The Particle Filter

17 Apr, 2025

I've yet to make a post on the EKF or UKF. If you have a solid foundation of linear algebra and calculus, they're relatively simple extensions of the Kalman Filter. Regardless, I do hope to make a post on them soon along with some tracking demos. In this post, I'll explore a completely different kind of model; one which doesn't rely on the heavy assumptions of linear dynamics/measurement models and Gaussian distributions. Nevertheless, all of these algorithms are meant to provide a strategy for estimating a hidden state when our dynamics and measurements are non-linear functions of the hidden state.

The goal of EKF, UKF, and Particle Filter

Say we have some system with the following properties:

\begin{matrix} x_{k + 1} = f (x_{k}, u_{k} (x_{k})) + ϵ_{k} \\ y_{k} = h (x_{k + 1}) + ω_{k} \end{matrix}

Due to the stochasticity in both propagation and measurements, both the next state and observation can be viewed (more intuitively) as a probability distribution:

\begin{matrix} x_{k + 1} ~ P (\cdot | x_{k}, u_{k} (x_{k})) \\ y_{k} ~ P (\cdot | x_{k}) \end{matrix}

To calculate this posterior exactly using the equations from the Kalman Filter, the noise must be Gaussian and functions must be linear (e.g. $x_{k + 1} = A x_{k} + B u_{k}$ ). This assumption almost never holds true in real-world scenarios, so the EKF and UKF give us strategies to act like the functions are linear and noise is Gaussian using linearization and the unscented transform, respectively.

The particle filter breaks from this paradigm entirely. It doesn't assume linearity or Gaussianity. Instead, it represents the posterior over states using a cloud of weighted samples and evolves them over time using propagation and importance sampling.

Monte Carlo Approximation

The basic idea of a Monte Carlo approximation is to use samples from a PDF to compute statistics instead of brute-force approaches. An example might be to compute the expected value of a distribution:

𝔼_{p} [f (x)] = \int_{- \infty}^{\infty} f (x) \cdot p (x) d x

There is likely not an analytic solution to this integral in many real world problems and computing over the entire space of inputs $x$ is impossible. Therefore, we can approximate by drawing samples from $p$ and then averaging them. The number of samples at different values for $x$ naturally reflects the distribution of samples under the probability distribution, so increasing the number of samples improves our approximation.

𝔼_{p} [f (x)] \approx \frac{1}{N} \sum_{i = 1}^{N} f (x^{(i)}) .

where $x^{(i)}$ are drawn from $p$ .

Importance Sampling

But what if we can't sample from $p$ ? This is where the idea of importance sampling comes into play. We can actually perform the following mathematical trick to approximate quantities related to $p$ by sampling another distribution and then weighting these samples. It's easier to show you.

𝔼_{p} [f (x)] = 𝔼_{q} [\frac{p (x)}{q (x)} f (x)] = \int_{- \infty}^{\infty} (\frac{p (x)}{q (x)} f (x)) q (x) d x \approx \frac{1}{N} \sum_{i = 1}^{N} \frac{p (x^{(i)})}{q (x^{(i)})} f (x^{(i)}) .

We call the quantity $\frac{p (x^{(i)})}{q (x^{(i)})}$ a weight. Therefore, our expectation can really be computed by sampling another distribution, but then weighting the samples according to their "mismatch" to the true distribution we want to sample from. It's a lot easier to find the posterior probability of a particular state than it is to sample from the entire distribution. I'll show you why finding the posterior distribution exactly is intractable right after this, but just know that importance sampling is incredibly powerful and underlies the particle filter.

Particles

In a particle filter, we maintain a list of paired weights and samples. For now, I will rewrite the system without control inputs for simplicity sake:

\begin{matrix} x_{k + 1} = f (x_{k}) + ϵ_{k} \\ y_{k} = h (x_{k}) + ω_{k} \end{matrix}

Our goal is to use a set of particles to approximate the posterior:

P (x_{k} | z_{1 : k})

The brute force Bayes Rule computation is written as follows:

P (x_{k} | z_{1 : k}) = \frac{P (z_{k} | x_{k}) \cdot P (x_{k})}{P (z_{1 : k)})}

Expanding $P (x_{k} | x_{k - 1})$ :

P (x_{k} | z_{1 : k}) = \frac{P (z_{k} | x_{k}) \cdot \int_{\infty}^{\infty} P (x_{k} | x_{k - 1} = x) P (x_{k - 1} = x) d x}{P (z_{1 : k})}

Obviously computing this integral requires an integral over the entire state space which is intractable and the bottom requires the double integral over previous states and current states which is doubly intractable.

In the particle filter, we have this set of particles, each of which represents a hypotheses about where the state could be — a potential instantiation of the stochastic process. The weight associated with the particle represents how likely of a hypothesis it is and, in this way, the set of particles and weights represent the true distribution. It will help to develop this framework more rigorously by looking at a specific predict-update cycle.

Let's assume we have two lists of particles and their weights $X / W$ representing the approximation of the posterior of the previous step $P (x_{k - 1} | x_{k - 2}, z_{k - 1})$ . First, let's compute the update. We really compute the probabilities through the weights, as the particles themselves carry the actual value of the hidden state.

The weight of the particle has the following form which comes from importance sampling:

w_{k}^{(i)} \propto \frac{p (x_{0 : k}^{(i)} | z_{1 : k})}{q (x_{0 : k}^{(i)} | z_{1 : k})}

As you can see, the individual particles in the particle filter are really approximating the distribution of the entire trajectory distribution conditioned on observations. To get the probability of a certain state, we sum the weights of all the particles with that hidden state:

P (x_{k} = x | z_{1 : k}) \approx \sum_{i = 1}^{N} w_{k}^{(i)} δ (x - x_{k}^{(i)})

Another approach would be the MAP, where we just take the particle with the highest weight at each timestep.

Returning back to our original weight formula, let's first figure out how to make this definition recursive. Due to the markov property, we can actually decompose both the numerator like so:

p (x_{0 : k}^{(i)} | z_{1 : k}) = p (x_{0 : k - 1}^{(i)} | z_{1 : k - 1}) \cdot p (x_{k}^{(i)} | x_{k - 1}^{(i)}) \cdot p (z_{k} | x_{k}^{(i)})

And the denominator:

q (x_{0 : k}^{(i)} | z_{1 : k}) = q (x_{0 : k - 1}^{(i)} | z_{1 : k - 1}) \cdot q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k})

Notice that in both formulations that term with the trajectory is the respective part of the previous weight. This means the weights can be updated recursively:

w_{k}^{(i)} \propto w_{k - 1}^{(i)} \cdot \frac{p (x_{k}^{(i)} | x_{k - 1}^{(i)}) \cdot p (z_{k} | x_{k}^{(i)})}{q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k})}

We already have $x_{k - 1}^{(i)}$ , that's our particles from the previous timestep. $x_{k}^{(i)}$ are the sampled particles from whatever our choice of proposal is. There are many different choices we could use, but a common choice is known as the "Bootstrap filter". In this case, we choose to make the proposal distribution equal to the transition model:

q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k}) = p (x_{k}^{(i)} | x_{k - 1}^{(i)})

This simplifies our weight update massively:

w_{k}^{(i)} \propto w_{k - 1}^{(i)} \cdot p (z_{k} | x_{k}^{(i)})

So the particles $x_{k}^{(i)}$ come from propagating the previous points through our motion model. Then we just evaluate the likelihood and multiply this by our previous weights. Putting things into log space turns multiplication into sums and that's a bit nicer to work with.

What if we chose a different proposal? What if it was, for example, an Extended Kalman Filter (EKF)-based proposal distribution?

In this case, the proposal $q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k})$ is no longer equal to the transition model, and the denominator no longer cancels with the numerator. The EKF gives us a Gaussian approximation to the posterior $p (x_{k} | x_{k - 1}^{(i)}, z_{k})$ , which we use to draw samples $x_{k}^{(i)}$ .

This gives us potentially better samples—since the proposal incorporates the latest observation $z_{k}$ , particles are more likely to land in high-likelihood regions—but it comes at a cost. The weight update becomes:

w_{k}^{(i)} \propto w_{k - 1}^{(i)} \cdot \frac{p (x_{k}^{(i)} | x_{k - 1}^{(i)}) \cdot p (z_{k} | x_{k}^{(i)})}{q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k})}

We now have to evaluate all three terms explicitly:

The transition probability $p (x_{k}^{(i)} | x_{k - 1}^{(i)})$
The observation likelihood $p (z_{k} | x_{k}^{(i)})$
The proposal probability $q (x_{k}^{(i)} | x_{k - 1}^{(i)}, z_{k})$ , which is typically a Gaussian density evaluated at $x_{k}^{(i)}$ with mean and covariance from the EKF update

So while we may gain efficiency in sampling by focusing particles in the right areas, we pay in computation by needing to calculate the full importance weight. In practice, this often leads to better performance in problems where observations provide strong information and the dynamics are noisy or uncertain.

Finally, we have to make sure the weights are normalized so they form a valid probability distribution over particles:

{\tilde{w}}_{k}^{(i)} = \frac{w_{k}^{(i)}}{\sum_{j = 1}^{N} w_{k}^{(j)}}

This ensures that:

\sum_{i = 1}^{N} {\tilde{w}}_{k}^{(i)} = 1

Weight Degeneracy

A common problem, particularly with the Bootstrap Filter, is weight degeneracy, where most weights are near zero and a few weights have high probability. This happens with the bootstrap filter a lot because our proposal does not consider the observation at all. Therefore, the weight update may only have a high observation likelihood for a few particles, causing huge variance.

This is a huge problem because we only have a few particles that are accurately tracking the state. We might be missing certain modes and are likely getting a bad approximation. A common fix for this is resampling, where after each observation update we delete the particles with low probability and split the high probability particles into several. This lets us have a new set of equal-probability weights for the next predict-update cycle.

I won't get into the strategy for resampling here, but it's pretty straightforward. The intuition is pretty clear from the picture below:

resampling