The Bayes Filter in Different Contexts

30 Mar, 2025

Bayes Filter

The Bayes Filter is a general algorithm for state estimation. A Kalman Filter, for example, is a specific implementation of the Bayes Filter and may be the most important algorithm in robotics. Due to the generality of the Bayes Filter, it was difficult for me to grasp the different applications. I've found that studying the implementation in different contexts helped me quite a bit. In this post, I'll cover two different scenarios where a Bayes Filter is useful: state estimation with a Hidden Markov Model (beginning with a regular Markov Model) and with an "Action Model".

Markov Model - Transitioning between states without observation

Our system can be in one of two states. The likelihood of transitioning to one state from another is summarized in a transition matrix:

T = [\begin{matrix} 0.4 & 0.6 \\ 0.1 & 0.9 \end{matrix}] = [\begin{matrix} P (X_{t + 1} = 0 | X_{t} = x = 0) & P (X_{t + 1} = 1 | X_{t} = x = 0) \\ P (X_{t + 1} = 0 | X_{t} = x = 1) & P (X_{t + 1} = 1 | X_{t} = x = 1) \end{matrix}]

If we're certain about the starting probability distribution (we know what state we're in), $P (X_{1}) = π_{1} = [P (X_{1} = 0), P (X_{1} = 1)] = [1, 0]$ . We need to use the law of total probability to compute the probability of the next state $P (X_{2})$ or $π_{2}$ :

P (X_{2}) = [P (X_{2} = 0 | X_{1} = 0) P (X_{1} = 0) + P (X_{2} = 0 | X_{1} = 1) P (X_{1} = 1), P (X_{2} = 1 | X_{1} = 0) P (X_{1} = 0) + P (X_{2} = 1 | X_{1} = 1) P (X_{1} = 1)]

Equivalently:

π_{2} = [T_{00} π_{1}^{(0)} + T_{10} π_{1}^{(1)}, T_{01} π_{1}^{(0)} + T_{11} π_{1}^{(1)}]

This means $π_{2}^{(i)}$ is just the dot product of $π_{1}$ and the $i$ -th column of $T$ . Therefore, in general terms, we can compute $π_{t}$ by taking the dot product of $T^{'}$ and $π_{t - 1}$ where $T^{'}$ is the transpose:

π_{t} = T^{'} π_{t - 1}

As $t \to inf$ , the distribution stops changing which gives us the steady-state distribution of the system. We can approximate the steady-state distribution by running the transition propagation algorithm many times, or exploit the fact that the distribution won't change as follows:

π_{inf} = T^{'} π_{inf}

Therefore, the steady state distribution is just the eigenvector of $T^{'}$ for an eigenvalue $1$ .

Note that this is just a description of how the state would evolve without any observations that might help us refine our judgment. I haven't yet introduced the "hidden" aspect which lets us exploit observations -- so far it's just a Markov Model. At this point, the state at any timestep $t$ can be fully described by the starting state (which may be initialized using the steady-state distribution) and the transition matrix:

$P (X_{t} = x | X_{0}, T)$

By computing this value for every state $x$ , we get the probability distribution of the state at time $t$ . This is what we want.

The markov model provides us a very clean method of representing the state of the system and how it evolves. It's all packaged up nicely. The action model is a bit less elegant, although it is more generally applicable.

Under the markov assumption, $X_{t}$ depends only on $X_{t - 1}$ . However, in the above conditional probability, $X_{t}$ seems to depend on $X_{0}$ and $T$ . Because we don't know what $X_{t - 1}$ will be, we must marginalize over all possible values of $X_{t - 1}$ to express $P (X_{t} | X_{0}, T)$ in a form which can be computed recursively and shows the Markov assumption more clearly. For the rest of this section, I won't show the explicit dependence on $T$ .

P (X_{t} = x | X_{0}) = \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'} | X_{0})

Oftentimes, the prior $X_{0}$ is also assumed implicitly:

P (X_{t} = x) = \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'})

I'll go with this notation because it's pretty clear that the recursion begins with $P (X_{0})$ . Finally, we can express the conditional in terms of the transition matrix:

P (X_{t} = x) = \sum_{x^{'}} T_{x^{'} x} P (X_{t - 1} = x^{'})

Action Model -- Transitioning between states without observation

In the action model, our state is the position of a robot in a $2 \times 2$ grid. We are given a list of actions $u_{1 : t}$ over the timesteps along with probabilities related to how an action affects the state. This is known as the motion model, similar to $T$ in the hidden markov model. A robot can either move up, down, left, or right. It succeeds with probability $.9$ and fails with probability $.1$ . If the action would move the robot off the grid, it stays in place with absolute certainty ( $1$ ).

I think this approach lends itself more to a code implementation, although I won't get into those details in this post.

If our set of actions is $u_{1 : t}$ , then our probability of interest is the state at time $t$ :

P (X_{t} = x | X_{0}, u_{1 : t})

Once again, we can assume the prior implicitly: $P (X_{t} = x | u_{1 : t})$ . Similar to the HMM, the dependence on the entire sequence of actions reflects the fact that there are many possible previous states $X_{t - 1}$ given our model. If we knew the previous state, we could describe the current state just in terms of the previous state and the current action $P (X_{t} = x | X_{t - 1}, u_{t})$ .

If we didn't know the previous state, we would marginalize over possible previous states:

P (X_{t} = x | u_{t}) = \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}, u_{t}) P (X_{t - 1} = x^{'} | u_{t - 1})

To align more with the HMM notation, we could say that $P (X_{t})$ implicitly considers the current action and the same for the previous state, therefore expressing it as such:

P (X_{t} = x) = \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}, u_{t}) P (X_{t - 1} = x^{'})

However, it's convention to show that the distribution of the current state depends on the history of all previous states, so I will depart from the HMM notation from here on out. The idea is the exact same.

P (X_{t} = x | u_{1 : t}) = \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}, u_{t}) P (X_{t - 1} = x^{'} | u_{1 : t - 1})

Including observations in the HMM estimate

Now let's say we have a set of observations which provide us some information about the probability (this matrix doesn't have to be square -- we can have more observation modalities than hidden states):

M = [\begin{matrix} P (Y_{t} = 0 | X_{t} = 0) & P (Y_{t} = 1 | X_{t} = 0) \\ P (Y_{t} = 0 | X_{t} = 1) & P (Y_{t} = 1 | X_{t} = 1) \end{matrix}]

Now our state estimation objective must include the sequence of observations:

P (X_{t} = x | Y_{1 : t})

At this point, I think you get the idea about the notation basically being determined by clarity and preference, so I won't explain myself further.

If we knew the previous state, we could write $P (X_{t} = x | X_{t - 1}, Y_{t})$ . However, we must marginalize over all possible states of $X_{t - 1}$ . I will use Bayes Rule in this derivation and the marginalization will appear.

P (X_{t} = x | Y_{1 : t}) = \frac{P (X_{t} = x, Y_{1 : t})}{P (Y_{1 : t})} = \frac{P (Y_{t} | X_{t} = x, Y_{1 : t - 1}) P (X_{t} = x | Y_{1 : t - 1}) P (Y_{1 : t - 1})}{P (Y_{1 : t})}

Now observe this:

\begin{matrix} P (Y_{1 : t}) = P (Y_{t}, Y_{1 : t - 1}) = P (Y_{t} | Y_{1 : t - 1}) P (Y_{1 : t - 1}) \\ P (Y_{1 : t - 1}) = \frac{P (Y_{1 : t})}{P (Y_{t} | Y_{1 : t - 1})} \end{matrix}

Replacing this in the original expression:

\begin{matrix} \frac{P (Y_{t} | X_{t} = x, Y_{1 : t - 1}) P (X_{t} = x | Y_{1 : t - 1}) P (Y_{1 : t - 1})}{P (Y_{1 : t})} = \frac{P (Y_{t} | X_{t} = x, Y_{1 : t - 1}) P (X_{t} = x | Y_{1 : t - 1}) P (Y_{1 : t})}{P (Y_{1 : t}) P (Y_{t} | Y_{1 : t - 1})} \\ = \frac{P (Y_{t} | X_{t} = x, Y_{1 : t - 1}) P (X_{t} = x | Y_{1 : t - 1})}{P (Y_{t} | Y_{1 : t - 1})} \\ = \frac{P (Y_{t} | X_{t} = x) P (X_{t} = x | Y_{1 : t - 1})}{P (Y_{t} | Y_{1 : t - 1})} \\ = \frac{P (Y_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x, X_{t - 1} = x^{'} | Y_{1 : t - 1})}{P (Y_{t} | Y_{1 : t - 1})} \\ = \frac{P (Y_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'} | Y_{1 : t - 1})}{P (Y_{t} | Y_{1 : t - 1})} \end{matrix}

Defining a variable called ${bel}_{t} (x) = P (X_{t} = x | Y_{1 : t})$ will help simplify this expression. I'll also substitute $T$ and $M$ :

\begin{matrix} {bel}_{t} (x) = P (X_{t} = x | Y_{1 : t}) = \frac{M_{x Y_{t}} \sum_{x^{'}} T_{x^{'} x} {bel}_{t - 1} (x^{'})}{P (Y_{t} | Y_{1 : t - 1})} \end{matrix}

Finally, since the denominator doesn't depend on the state, we can consider this a normalizing factor to simplify. I'll come back to computing this later:

η = \frac{1}{P (Y_{t} | Y_{1 : t - 1})}

{bel}_{t} (x) = η M_{x Y_{t}} \sum_{x^{'}} T_{x^{'} x} {bel}_{t - 1} (x^{'})

Including observations in the action model estimate

Now let's say we have a set of observations which provide us some information about the state. Maybe the points on the grid are painted either black or white and these values are known ahead of time. The sensor is noisy, correctly predicting the color with an accuracy of .9.

Our new state estimation objective:

\begin{matrix} P (X_{t} = x | u_{1 : t}, z_{1 : t}) = \frac{P (X_{t} = x, u_{1 : t}, z_{1 : t})}{P (u_{1 : t}, z_{1 : t})} \\ = \frac{P (z_{t} | X_{t} = x, u_{1 : t}, z_{1 : t - 1}) P (X_{t} = x, u_{1 : t}, z_{1 : t - 1})}{P (u_{1 : t}, z_{1 : t})} \\ = \frac{P (z_{t} | X_{t} = x, u_{1 : t}, z_{1 : t - 1}) P (X_{t} = x | u_{1 : t}, z_{1 : t - 1}) P (u_{1 : t}, z_{1 : t - 1})}{P (u_{1 : t}, z_{1 : t})} \\ = \frac{P (z_{t} | X_{t} = x) P (X_{t} = x | u_{1 : t}, z_{1 : t - 1}) P (u_{1 : t}, z_{1 : t - 1})}{P (u_{1 : t}, z_{1 : t})} \end{matrix}

Similar to before, I can calculate the normalizing factor as follows:

\begin{matrix} P (u_{1 : t}, z_{1 : t}) = P (z_{t} | u_{1 : t}, z_{1 : t - 1}) P (u_{1 : t}, z_{1 : t - 1}) \\ P (u_{1 : t}, z_{1 : t - 1}) = \frac{P (u_{1 : t}, z_{1 : t})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})} \end{matrix}

Substituting:

\begin{matrix} \frac{P (z_{t} | X_{t} = x) P (X_{t} = x | u_{1 : t}, z_{1 : t - 1}) P (u_{1 : t}, z_{1 : t})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1}) P (u_{1 : t}, z_{1 : t})} \\ = \frac{P (z_{t} | X_{t} = x) P (X_{t} = x | u_{1 : t}, z_{1 : t - 1})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})} \\ = \frac{P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x, X_{t - 1} = x^{'} | u_{1 : t}, z_{1 : t - 1})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})} \\ = \frac{P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}, u_{1 : t}, z_{1 : t - 1}) P (X_{t - 1} = x^{'} | u_{1 : t}, z_{1 : t - 1})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})} \\ = \frac{P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'} | u_{1 : t}, z_{1 : t - 1})}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})} \end{matrix}

Since the denominator doesn't depend on the state, I'll let $η = \frac{1}{P (z_{t} | u_{1 : t}, z_{1 : t - 1})}$ once again and substitute:

\begin{matrix} P (X_{t} = x | u_{1 : t}, z_{1 : t}) = η P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'} | u_{1 : t}, z_{1 : t - 1}) \end{matrix}

Since we don't have a clear definition for the transition (or motion model) and emission (sensor model), I will simply call create two functions and redefine ${bel}_{t} (x)$ in this context:

\begin{matrix} sensor (z_{t}, x i) = P (z_{t} | X_{t} = x) \\ motion (x, x^{'}) = P (X_{t} = x | X_{t - 1} = x^{'}) \\ {bel}_{t} (x) = P (X_{t} = x | u_{1 : t}, z_{1 : t}) \end{matrix}

Finally, I will substite all these into the derived expression:

\begin{matrix} P (X_{t} = x | u_{1 : t}, z_{1 : t}) = η sensor (z_{t}, x) \sum_{x^{'}} motion (x, x^{'}) {bel}_{t - 1} (x^{'}) \end{matrix}

Finally, I will get to deriving $η$ . I'm only going to do it for the action model because they're basically the same, and I don't feel like doing the same thing twice!

\begin{matrix} P (z_{t} | z_{1 : t - 1}) = \sum_{x} P (z_{t} | X_{t} = x, z_{1 : t - 1}) P (X_{t} = x | z_{1 : t - 1}) \\ = \sum_{x} P (z_{t} | X_{t} = x) P (X_{t} = x | z_{1 : t - 1}) \\ = \sum_{x} P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}, z_{1 : t - 1}) P (X_{t - 1} = x^{'} | z_{1 : t - 1}) \\ = \sum_{x} P (z_{t} | X_{t} = x) \sum_{x^{'}} P (X_{t} = x | X_{t - 1} = x^{'}) P (X_{t - 1} = x^{'} | z_{1 : t - 1}) \\ = \sum_{x} sensor (z_{t}, x) \sum_{x^{'}} motion (x, x^{'}) {bel}_{t - 1} (x^{'}) \end{matrix}

I will include everything for a single Bayes filter in one equation because it's satisfying to look at. This means I'll have to rename variables to avoid overlap.

\begin{matrix} P (X_{t} = x ∣ u_{1 : t}, z_{1 : t}) = \frac{sensor (z_{t}, x) \sum_{x^{'}} motion (x ∣ x^{'}) {bel}_{t - 1} (x^{'})}{\sum_{\bar{x}} sensor (z_{t}, \bar{x}) \sum_{x^{'}} motion (\bar{x} ∣ x^{'}) {bel}_{t - 1} (x^{'})} \end{matrix}

This is nice. Next stop, Kalman Filters. After that, Extended and Unscented Kalman Filters.