The Kalman Filter

01 Apr, 2025

Finally...

We've gotten to the Kalman Filter. Up to this point, I've covered markov models, hidden markov models and the problems that can be solved with them (filtering, smoothing, decoding, etc), Bayes filters, Linear Algebra, and the strategy for optimally combining estimators linearly. In this post, I will bring everything together with the Kalman Filter.

To put it concisely, the Kalman Filter is the optimal solution to the Bayes Filter for systems with linear dynamics and Gaussian Noise. What does this mean? Well let me review:

Markov Models, Hidden Markov Models, and the new Markov Decision Process (MDP)

An MDP describes a "stochastic dynamic system" -- a process where our state transitions from one to another depending on some control input, yet there is noise associated with the state transition. We are unsure of whether or not the stated control input will translate to the update in state that we are expecting. If our state transition dynamics are described by a function $f$ , the markov decision process can be represented mathematically as:

x_{k + 1} = f (x_{k}, u_{k}) + ϵ_{k}

Since the transition is stochastic, not deterministic, the state transition can alternatively be represented as a probability distribution:

x_{k + 1} = P (x_{k + 1} | x_{k}, u_{k})

This situation is similar to the traditional Markov Model except that, for each control input, we have a separate transition matrix.

Lastly, since this situation describes the system modeled by the Kalman Filter, a linear stochastic system is formulated as follows:

x_{k + 1} = A x_{k} + B u_{k} + ϵ_{k}

Consider one more thing -- our observations in an HMM do not necessarily directly represent our state. It may be the case that our state is position and our GPS measures this position with some noice. However, it may be that the relation is less direct. What is the analogy for a more linear stochastic system? ** MORE ANALOGY **

Incorporating Gaussian Observations of a State Linearly

Imagine our sensor relates to the d-dimensional state through a linear function but it has some zero-mean Gaussian noise associated with it:

ℝ^{p} ∋ Y = C X + v

Notice the dimensionality of $Y$ is $ℝ^{p}$ but our state $X$ has dimension $ℝ^{d}$ . Therefore, our $C$ matrix must have shape $p \times d$ . Additionally, we have $v ~ N (0, Q)$ .

It's similar to how, in an HMM, our emission matrix $M$ may not be square. We could have more observations than state, or less.

New Estimate:

\begin{matrix} \hat{X} = K^{'} {\hat{X}}^{'} + K Y = K^{'} {\hat{X}}^{'} + K (C {\hat{X}}^{'} + v) \\ E [\hat{X}] = E [K^{'} {\hat{X}}^{'} + K Y] = E [K^{'} {\hat{X}}^{'} + K (C {\hat{X}}^{'} + v)] = K^{'} E [{\hat{X}}^{'}] + K E [C {\hat{X}}^{'} + v] = K^{'} E [{\hat{X}}^{'}] + K C E [{\hat{X}}^{'}] + K E [v] \\ K^{'} {\hat{X}}^{'} + K C {\hat{X}}^{'} = {\hat{X}}^{'} \\ K^{'} + K C = I \\ K^{'} = I - K C \end{matrix}

Substituting:

\begin{matrix} \hat{X} = (I - K C) {\hat{X}}^{'} + K Y \\ = {\hat{X}}^{'} - K C {\hat{X}}^{'} + K Y \\ = {\hat{X}}^{'} + K (Y - C {\hat{X}}^{'}) \end{matrix}

Covariance using independence assumption:

\begin{matrix} Σ_{\hat{X}} = Cov ((I - K C) {\hat{X}}^{'} + K Y) \\ = (I - K C) Σ_{{\hat{X}}^{'}} (I - K C)^{T} + K Q K^{T} \end{matrix}

Now how can we minimize the trace of the covariance of the updated estimator with respect to $K$ ? My matrix calculus is not solid enough for this, so I'll have to take their word:

0 = \frac{\partial}{\partial K} tr (Σ_{\hat{X}})

0 = - 2 (I - K C) Σ_{{\hat{X}}^{'}} C^{T} + 2 K Q

\Rightarrow Σ_{{\hat{X}}^{'}} C^{T} = K (C {\hat{Σ}}_{X^{'}} C^{T} + Q)

\Rightarrow K = Σ_{{\hat{X}}^{'}} C^{T} (C Σ_{{\hat{X}}^{'}} C^{T} + Q)^{- 1} .

Now notice what happens if we set $C = I$ (the observation directly estimates state):

K = Σ_{{\hat{X}}^{'}} C^{T} (C Σ_{{\hat{X}}^{'}} C^{T} + Q)^{- 1} = Σ_{{\hat{X}}^{'}} (Σ_{{\hat{X}}^{'}} + Q)^{- 1}

Substituting back into our previous formula:

\hat{X} = K^{'} {\hat{X}}^{'} + (Σ_{{\hat{X}}^{'}} (Σ_{{\hat{X}}^{'}} + Q)^{- 1}) Y = \hat{X} + (Σ_{{\hat{X}}^{'}} (Σ_{{\hat{X}}^{'}} + Q)^{- 1}) (Y - C {\hat{X}}^{'})

\hat{X} = K^{'} {\hat{X}}^{'} + (Σ_{{\hat{X}}^{'}} (Σ_{{\hat{X}}^{'}} + Q)^{- 1}) Y

Kalman Filter

Now that we know how to update an estimator using an observation which is a linear function of the hidden state, I can now introduce the Kalman Filter.

The system can be described as follows:

x_{k + 1} = A x_{k} + B u_{k} + ϵ_{k} y_{k} = C x_{k} + v_{k}

with $ϵ_{k} ~ N (0, R)$ and $v_{k} ~ N (0, Q)$ .

The first equation describes how the state evolves over time. As you can see, the future state without incoorporating observation is a linear function of the input. Since everything is linear, all states will be gaussians. Therefore, they can be described completely using the mean and covariance. As a result, we only need to figure out what the $μ$ and $Σ$ are at each step.

The Kalman Filter proceeds in two distinct steps, similar to the Bayes Filter. First, the state is propoagated independently of any observations using our model of the system's dynamics. Next, when we get an observation, we update the state estimate. There's some notation I want to introduce to make this a bit clearer:

{\hat{x}}_{k | k} ~ P (x_{k} | y_{1}, \dots, y_{k}) ~ N (μ_{k | k}, Σ_{k | k})

{\hat{x}}_{k + 1 | k} ~ P (x_{k + 1} | y_{1}, \dots, y_{k}) ~ N (μ_{k + 1 | k}, Σ_{k | k})

Therefore, given $x_{k | k}$ and $y_{k + 1}$ we want to get $x_{k + 1 | k + 1}$ . This requires us to first propagate $x_{k | k}$ to $x_{k + 1 | k}$ , and then factor observation $y_{k + 1}$ .

Propagation

First, we want to find $x_{k + 1 | k}$ , or $P (x_{k + 1} | y_{1})$ :

{\hat{x}}_{k + 1 | k} = A {\hat{x}}_{k | k} + B u_{k} + ϵ_{k}

μ_{k + 1 | k} = E [{\hat{x}}_{k + 1 | k}] = E [A {\hat{x}}_{k | k} + B u_{k} + ϵ_{k}] = A μ_{k | k} + B u_{k} Σ_{k + 1 | k} = Cov ({\hat{x}}_{k + 1 | k}) = Cov (A {\hat{x}}_{k | k} + B u_{k} + ϵ_{k}) = A Σ_{k | k} A^{T} + R

Observation

Now we have $x_{k + 1 | k}$ but need to update with our most recent observation.

y_{k + 1} = C x_{k + 1} + v_{k + 1}

So recall our formula from before:

x_{k + 1 | k + 1} = x_{k + 1 | k} + K_{k + 1} (y_{k + 1} - C x_{k + 1 | k})

Now we need to propagate the means:

μ_{k + 1 | k + 1} = μ_{k + 1 | k} + K_{k + 1} (y_{k + 1} - C μ_{k + 1 | k}) Σ_{k + 1 | k + 1} = (I - K_{k + 1} C) Σ_{k + 1 | k} (I - K_{k + 1} C)^{T} + K_{k + 1} Q K_{k + 1}^{T}

Or we can used the condensed form for $Σ$

Σ_{k + 1 | k + 1} = (Σ_{k + 1 | k}^{- 1} + C^{T} Q^{- 1} C)^{- 1}

Notice that we really have two different formulations of the Kalman Filter update step:

New Estimate = K^{'} X + K Y = (I - K C) X + K Y

and

New Estimate = X + K (Y - C X)

I generally favor the clarity of the second. It shows that we have some noisy estimate from our dynamics model $X$ and some noisy estimate from our sensor model $Y$ , and we essentially want to update the estimate with this noisy observation. The way we do that is by taking the difference of our observation from what our dynamics model would have expected the observation to be (basically the new information added by our observation) and then scaling it by some factor related to both the covariance of the estimate and the observation. If we have a noisier observation, $K$ will be smaller.

By using the form $X + K (Y - C X)$ or, equivalently, $(I - K C) X + K Y$ , we enforce that our estimated value is unbiased. Then, by minimizing the trace of the covariance of the new estimator with respect to $K$ , we find the optimal formulation for combining the estimate and observation.

I hope this is relatively intuitive by this point. Although the matrix calculus is a bit over my head, I think previous examples give sufficient understanding for why the Kalman Filter is the optimal linear estimator for a system with Gaussian noise.