Expectation, Covariance, and Optimal Estimators

01 Apr, 2025

Expected Value and Variance

$X$ can take a range of values. The distribution of outcomes is dependent on the underlying pdf $f (x)$ . It has a mean value, also known as the expected value. The mean and expected value are denoted by $μ_{x}$ and $E [X]$ , respectively.

We can calculate the expected value by doing a weighted sum of the outcome of a random value and its corresponding probability:

E [X] = \int_{- \infin}^{\infin} x f (x) d x

Now we also want a measure of the spread -- how much the outcomes typically deviate from the expected value.

σ_{x}^{2} = \int_{- \infin}^{\infin} (x - E [x])^{2} f (x) d x

$σ$ is the standard deviation and $σ^{2}$ is the variance.

Covariance for $X \in ℝ^{d}$

Now what if we have a multivariate random variable? Say, for instance, the position of a robot on a 2D Grid: $X = [\begin{matrix} x \\ y \end{matrix}]$ .

The expected value for the vector can be found simply by taking the expected value element-wise:

E [X] = [\begin{matrix} \int_{- \infin}^{\infin} (x - E [x])^{2} f (x) d x \\ \int_{- \infin}^{\infin} (y - E [y])^{2} f (y) d y \end{matrix}]

Variance is actually trickier. The features are related in some way (that's why they're in a single vector) so taking the element-wise will leave valuable information out. We can categorize the relation between the two features into two groups:

When $x$ is big, $y$ is big. When $x$ is small, $y$ is small.
When $x$ is big, $y$ is big. When $x$ is small, $y$ is big.

But what do we mean by big or small? Well, the difference between the random variable and its expected value is a good measure! It's small if $X - E [X] < 0$ and big if $X - E [X] > 0$ .

Note that it's not possible to have this configuration:

When $x$ is big, $y$ is big.
When $x$ is small, $y$ is big.

If this were the case, it'd mean that $y$ is always big! This isn't possible because then the mean would be bigger and there would be more small $y$ values.

The relation between $x$ and $y$ is really what defines the covariance. It tells us a lot about the "shape" of the distribution. But how do we calculate it? Well, we only really care about whether or not the variables change together or not. It doesn't matter if they're both small or both big -- this is one group. We want a positive covariance if they're directly proportional and a negative if they're inversely proportional. Consider the formula for the expected value and it should feel quite intuitive:

Cov (X, Y) = E [(X - E [X]) (Y - E [Y])]

If both values are small or big (negative * negative = positive), the value inside the expected value will be positive. If they're different, it will be negative. The expected value gives us a notion of the average result.

A nice property of covariance is that taking the covariance of a random variable with itself is simply the variance.

Cov (X, X) = E [(X - E [X]) (X - E [X])] Cov (X, X) = E [(X - E [X])^{2}] = \int_{- \infin}^{\infin} (X - E [X])^{2} f (x) d x

So for our multidimensional random variable, the covariance matrix would look like this:

$Σ_{X} = [\begin{matrix} Cov (x, x) & Cov (x, y) \\ Cov (y, x) & Cov (y, y) \end{matrix}] = [\begin{matrix} Var (x) & Cov (x, y) \\ Cov (y, x) & Var (y) \end{matrix}]$

And for a random vector with D dimensions:

Σ_{X} = [\begin{matrix} Var (X_{1}) & Cov (X_{1}, X_{2}) & \dots & Cov (X_{1}, X_{d}) \\ Cov (X_{2}, X_{1}) & Var (X_{2}) & \dots & Cov (X_{2}, X_{d}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Cov (X_{d}, X_{1}) & Cov (X_{d}, X_{2}) & \dots & Var (X_{d}) \end{matrix}]

This can actually be calculated using a matrix product as follows:

Σ_{X} = [\begin{matrix} E [(X_{1} - E [X_{1}]) (X_{1} - E [X_{1}])] & E [(X_{1} - E [X_{1}]) (X_{2} - E [X_{2}])] & \dots & E [(X_{1} - E [X_{1}]) (X_{d} - E [X_{d}])] \\ E [(X_{2} - E [X_{2}]) (X_{1} - E [X_{1}])] & E [(X_{2} - E [X_{2}]) (X_{2} - E [X_{2}])] & \dots & E [(X_{2} - E [X_{2}]) (X_{d} - E [X_{d}])] \\ ⋮ & ⋮ & ⋱ & ⋮ \\ E [(X_{d} - E [X_{d}]) (X_{1} - E [X_{1}])] & E [(X_{d} - E [X_{d}]) (X_{2} - E [X_{2}])] & \dots & E [(X_{d} - E [X_{d}]) (X_{d} - E [X_{d}])] \end{matrix}]

X_{c} = [\begin{matrix} X_{1} - E [X_{1}] \\ X_{2} - E [X_{2}] \\ ⋮ \\ X_{d} - E [X_{d}] \end{matrix}]

X_{c}^{T} = [\begin{matrix} X_{1} - E [X_{1}] & X_{2} - E [X_{2}] & \dots & X_{d} - E [X_{d}] \end{matrix}]

Σ_{X} = E [X_{c} X_{c}^{T}]

Feel free to verify this product yourself.

Properties of Expectation and Covariance

Linearity

We want to show:

E [a X + b Y] = a E [X] + b E [y]

First, note that, using marginalization: $\int_{- \infin}^{\infin} f (x, y) d y = f (x)$ . With that, we can proceed with the derivation:

E [a X + b Y] = \int_{- \infin}^{\infin} (a X + b Y) f (x, y) d x d y = \int_{- \infin}^{\infin} (a X) f (x, y) d x d y + \int_{- \infin}^{\infin} (b Y) f (x, y) d x d y = a \int_{- \infin}^{\infin} (X) f (x) d y + b \int_{- \infin}^{\infin} (Y) f (x) d y = a E [X] + b E [y]

$Cov (X + Y) = Cov (X) + Cov (X, Y) + Cov (Y, X) + Cov (Y)$

\begin{matrix} Cov (X + Y) = E [(X + Y - E [X + Y]) (X + Y - E [X + Y])^{T}] = \\ E [(X + Y - E [X] - E [Y]) (X + Y - E [X] - E [Y])^{T}] \\ = E [(X - E [X]) (X - E [X])^{T} + (X - E [X]) (Y - E [Y])^{T} \\ + (Y - E [Y]) (X - E [X])^{T} + (Y - E [Y]) (Y - E [Y])^{T}] \\ = E [(X - E [X]) (X - E [X])^{T}] + E [(X - E [X]) (Y - E [Y])^{T}] \\ + E [(Y - E [Y]) (X - E [X])^{T}] + E [(Y - E [Y]) (Y - E [Y])^{T}] \\ = Var (X) + Cov (X, Y) + Cov (Y, X) + Var (Y) \\ = Var (X) + 2 Cov (X, Y) + Var (Y) \end{matrix}

$Cov (A X)$

\begin{matrix} Cov (A X) = E [(A X - E [A X]) (A X - E [A X])^{T}] = E [(A X - A E [X]) (A X - A E [X])^{T}] \\ = E [A (X - E [X]) (X - E [X])^{T} A^{T}] \\ = A E [(X - E [X]) (X - E [X])^{T}] A^{T} = A Cov(X) A^{T} \end{matrix}

Note that, when I say $Cov (X)$ , this is the covariance of each feature of the random variables with respect to the other features. It's the same case as I explored earlier with $X = [\begin{matrix} x \\ y \end{matrix}]$ .

Estimation

Imagine $X \in ℝ^{d}$ is a hidden state which we can't observe. Since it's unobservable, we want an estimator $\hat{X}$ . The estimator may be noisy (ideally not), but it definitely shouldn't have bias. An unbiased estimator refers to an estimator which, in the long run, averages out to the value of the hidden state. In other words, despite the noise, the estimator is centered around the hidden states value.

To put it mathematically, we want $E [\hat{X}] = X$ .

The error of our estimator is $\tilde{X} = \hat{X} - X$ . The expected value of our estimator should be zero, because the bias is 0 and, on the average, the noise is displaced from the mean evenly on either side:

E [\tilde{X}] = E [\hat{X} - X] = E [\hat{X}] - E [X] = X - X = 0 .

The covariance of the estimator is $Σ_{\hat{X}}$ .

Combining Estimators

Oftentimes we have two separate estimators of a single hidden state: ${\hat{X}}_{1}, {\hat{X}}_{2}$ . How can we combine them to achieve an estimator which is better than either estimator on its own? Well, if we're looking for an optimal combination, we need some measure of optimality to optimize for. The trace of a matrix is the sum of its diagonals. In a covariance matrix, the diagonals are the variances for each of the features of the random variable. Since we really care more about the spread of the random variable, and not really the covariance between features, this seems like a good measurement for the overall noise level.

With that measure of optimality, we can formalize our task:

\hat{X} = f ({\hat{X}}_{1}, {\hat{X}}_{2}) s.t. tr (Σ_{\hat{X}}) is minimized and E [\hat{X}] = X

So, we assure the estimator is unbiased and then minimze the trace of the covariance.

Consider the example of a linear combination of two 1D Gaussians:

\hat{X} = f ({\hat{X}}_{1}, {\hat{X}}_{2}) = k_{1} {\hat{X}}_{1} + k_{2} {\hat{X}}_{2}

We want it to be unbiased:

\begin{matrix} E [\hat{X}] = E [k_{1} {\hat{X}}_{1} + k_{2} {\hat{X}}_{2}] = k_{1} E [{\hat{X}}_{1}] + k_{2} E [{\hat{X}}_{2}] = X \\ k_{1} X + k_{2} X = X \\ = k_{1} + k_{2} = 1 \\ k_{2} = 1 - k_{1} \end{matrix}

We assume the two estimators are independent so the covariance is $0$ . For independent RV, the following identity applies: $Var (a X + b Y) = a^{2} Var (X)^{2} + b^{2} Var (Y)^{2}$ .

\begin{matrix} Var (\hat{X}) = Var (k_{1} {\hat{X}}_{1} + k_{2} {\hat{X}}_{2}) = k_{1}^{2} σ_{1}^{2} + k_{2}^{2} σ_{2}^{2} \\ = k_{1}^{2} σ_{1}^{2} + (1 - k_{1})^{2} σ_{2}^{2} \\ = k_{1}^{2} σ_{1}^{2} + (1 - 2 k_{1} + k_{1}^{2}) σ_{2}^{2} \\ = k_{1}^{2} σ_{1}^{2} + σ_{2}^{2} - 2 k_{1} σ_{2}^{2} + k_{1}^{2} σ_{2}^{2} \end{matrix}

Now we want to minimize the variance w.r.t $k_{1}$ :

\begin{matrix} \frac{\partial}{\partial k_{1}} (k_{1}^{2} σ_{1}^{2} + σ_{2}^{2} - 2 k_{1} σ_{2}^{2} + k_{1}^{2} σ_{2}^{2}) = 0 \\ 2 k_{1} σ_{1}^{2} - 2 σ_{2}^{2} + 2 k_{1} σ_{2}^{2} = 0 \\ k_{1} σ_{1}^{2} - σ_{2}^{2} + k_{1} σ_{2}^{2} = 0 \\ k_{1} = \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} \end{matrix}

Solving for $k_{2}$ :

\begin{matrix} k_{2} = 1 - \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} \\ = \frac{σ_{1}^{2} + σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} - \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} \\ = \frac{σ_{1}^{2}}{σ_{1}^{2} + σ_{2}^{2}} \end{matrix}

Replacing $k_{1}, k_{2}$ with the critical values we calculated:

\hat{X} = \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} {\hat{X}}_{1} + \frac{σ_{1}^{2}}{σ_{1}^{2} + σ_{2}^{2}} {\hat{X}}_{2}

This should intuitively make sense. If \hat X_1 has a lot of noise ( $σ_{1}$ is big), ${\hat{X}}_{2}$ will contribute more to the estimator.

Combining Estimators for multidimensional RV

Now we have $X, {\hat{X}}_{1}, {\hat{X}}_{2}, \hat{X} \in ℝ^{d}$ and $K_{1}, K_{2} \in ℝ^{d \times d}$ :

\hat{X} = K_{1} {\hat{X}}_{1} + K_{2} {\hat{X}}_{2}

First, we must ensure the estimator is unbiased:

\begin{matrix} E [\hat{X}] = E [K_{1} {\hat{X}}_{1} + K_{2} {\hat{X}}_{2}] = K_{1} E [{\hat{X}}_{1}] + K_{2} E [{\hat{X}}_{2}] = K_{1} X + K_{2} X = X \\ K_{1} + K_{2} = I \\ K_{2} = I - K_{1} \end{matrix}

This condition will ensure the new estimator is unbiased. Onto the minimization of covariance:

\begin{matrix} Σ_{\hat{X}} = Cov (\hat{X}) = Cov (K_{1} {\hat{X}}_{1} + K_{2} {\hat{X}}_{2}) = Cov (K_{1} {\hat{X}}_{1}) + Cov (K_{2} {\hat{X}}_{2}) \\ = K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T} + K_{2} Σ_{{\hat{X}}_{2}} K_{2}^{T} \\ = K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T} + (I - K_{1}) Σ_{{\hat{X}}_{2}} (I - K_{1})^{T} \\ = K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T} + K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T} - K_{1} Σ_{{\hat{X}}_{2}} - Σ_{{\hat{X}}_{2}} K_{1}^{T} + Σ_{{\hat{X}}_{2}} \end{matrix}

\begin{matrix} Tr (Σ_{\hat{X}}) = Tr (K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T}) + Tr (K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T}) - Tr (K_{1} Σ_{{\hat{X}}_{2}}) - Tr (Σ_{{\hat{X}}_{2}} K_{1}^{T}) + Tr (Σ_{{\hat{X}}_{2}}) \\ = Tr (K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T}) + Tr (K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T}) - Tr (K_{1} Σ_{{\hat{X}}_{2}}) - Tr (K_{1} Σ_{{\hat{X}}_{2}}) + Tr (Σ_{{\hat{X}}_{2}}) \\ = Tr (K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T}) + Tr (K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T}) - 2 Tr (K_{1} Σ_{{\hat{X}}_{2}}) + Tr (Σ_{{\hat{X}}_{2}}) = \end{matrix}

Now taking the partial derivative w.r.t $K_{1}$ :

\begin{matrix} \frac{\partial}{\partial K_{1}} (Tr (K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T}) + Tr (K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T}) - 2 Tr (K_{1} Σ_{{\hat{X}}_{2}}) + Tr (Σ_{{\hat{X}}_{2}})) = 0 \\ \frac{\partial}{\partial K_{1}} (Tr (K_{1} Σ_{{\hat{X}}_{1}} K_{1}^{T}) + Tr (K_{1} Σ_{{\hat{X}}_{2}} K_{1}^{T}) - 2 Tr (K_{1} Σ_{{\hat{X}}_{2}})) = 0 \\ 2 K_{1} Σ_{{\hat{X}}_{1}} + 2 K_{1} Σ_{{\hat{X}}_{2}} + \frac{\partial}{\partial K_{1}} (- 2 Tr (K_{1} Σ_{{\hat{X}}_{2}})) = 0 \\ 2 K_{1} Σ_{{\hat{X}}_{1}} + 2 K_{1} Σ_{{\hat{X}}_{2}} - 2 Σ_{{\hat{X}}_{2}} = 0 \\ K_{1} Σ_{{\hat{X}}_{1}} + K_{1} Σ_{{\hat{X}}_{2}} - Σ_{{\hat{X}}_{2}} = 0 \\ K_{1} (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}}) = Σ_{{\hat{X}}_{2}} \\ K_{1} = Σ_{{\hat{X}}_{2}} (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} \end{matrix}

Solving for $K_{2}$ :

\begin{matrix} K_{2} = I - K_{1} = I - Σ_{{\hat{X}}_{2}} (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} \\ K_{2} = (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}}) (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} - Σ_{{\hat{X}}_{2}} (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} \\ K_{2} = (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}} - Σ_{{\hat{X}}_{2}}) (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} \\ K_{2} = (Σ_{{\hat{X}}_{1}}) (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} \end{matrix}

Rewriting our original expression for the estimator:

\hat{X} = Σ_{{\hat{X}}_{2}} (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} {\hat{X}}_{1} + (Σ_{{\hat{X}}_{1}}) (Σ_{{\hat{X}}_{1}} + Σ_{{\hat{X}}_{2}})^{- 1} {\hat{X}}_{2}

This looks really similar to the 1-D case and still aligns with intuition. If ${\hat{X}}_{1}$ has high covariance, ${\hat{X}}_{2}$ will be large and dominate in the new estimator.

This choice of $K_{1}$ and $K_{2}$ is actually optimal for all estimators given that the estimators, as the function we're optimizing is convex. I don't know that much about this, so I won't go into great detail. My understanding is that the convexity basically means optimizing the function will definitely give us a global minimum -- there aren't several local minima we could get stuck in.

The method of optimizing with respect to the trace of the covariance is foundational to the derivation for the Kalman Filter, so understanding this prerequisite is critical.