The Policy Gradient

05 May, 2025

In my course, we reviewed the Linear Quadratic Regulator and its downstream variants like MPC. While this deserves its own post, I'm just not going to get to it. Instead, I'll talk about a newer solution to the optimal control problem that leverages machine learning.

Stochastic Controllers

Up until this point, I've only dealt with deterministic feedback controllers. Given a certain state, we know what our control input will be. Even though this control inputs affect on the next state will often be stochastic due to the noise term in the dynamics, the controller itself was deterministic. Here, we do not make this assumption. Instead, our control input will be sampled from a distribution learned by a parameterized model. Why do we use a stochastic controller? As you'll eventually see, it helps us compute the policy gradient. And what's the policy gradient? Let me give some background.

Say we already have a stochastic controller parameterized by $θ$ known as $p_{θ}$ . We pass in a state to our model and get a distribution of control inputs. Our model often just predicts the mean of a Gaussian and we have a predetermined variance. Therefore:

u ~ p_{θ} (\cdot | x)

Our goal is to find the parameters of a model that maximize the average expected reward (we use reward instead of cost). Say we find a particular trajectory $τ$ by running our stochastic controller repeatedly—the reward is a discounted return:

R (τ) = \sum_{k = 0}^{\infty} γ^{k} r (x_{k}, u_{k})

The goal can then be stated as:

\hat{θ} = {argmax}_{θ} J (θ) = 𝔼_{τ ~ p_{θ} (τ)} [R (τ) | x_{0}]

Since our controller is stochastic, we are saying that we want to find the parameters of our model that will maximize the average return when we find a trajectory using that controller and a start state.

One way to improve our controller is by randomly perturbing the parameters, re-evaluating the expected return, and choosing parameters that perform better. Another way is by computing the policy gradient.

The search space is way too large to directly find the parameters that maximize this function and there is no closed form solution for such a broad problem as there is in a more constrained setting like LQR. Instead, we can use gradient ascent. This requires us to compute the policy gradient with respect to the parameters. The policy tells us which direction to shift the parameters to increase the expected return.

More formally, the gradient ascent algorithm looks as follows:

θ^{k + 1} = θ^{k} + η \nabla_{θ} 𝔼_{τ ~ p_{θ} (τ)} [R (τ)]

The tricky part is computing the policy gradient. Let's first take a look at the expanded gradient:

\nabla_{θ} 𝔼_{τ ~ p_{θ} (τ)} [R (τ)] = \nabla_{θ} \int R (τ) p_{θ} (τ) d τ

Since the return and trajectory do not explicitly depend on the parameters, we can move the gradient inside and then do a few tricks:

\begin{matrix} \int R (τ) \nabla_{θ} p_{θ} (τ) d τ = \\ \int R (τ) p_{θ} (τ) \frac{\nabla_{θ} p_{θ} (τ)}{p_{θ} (τ)} d τ = \\ \int R (τ) p_{θ} (τ) \nabla \log p_{θ} (τ) d τ = \\ 𝔼_{τ ~ p_{θ} (τ)} [R (τ) \nabla \log p_{θ} (τ)] \end{matrix}

Now our policy gradient can be computed as the expectation over all trajectories of the reward times the gradient of the log probability of that trajectory. First, we can approximate this by sampling some finite trajectories from our controller in Monte Carlo fashion:

𝔼_{τ ~ p_{θ} (τ)} [R (τ) \nabla \log p_{θ} (τ)] \approx \frac{1}{n} \sum_{i = 1}^{n} R (τ^{i}) \nabla \log p_{θ} (τ^{i})

This is much more tractable, but we still have an unaddressed gradient $\nabla \log p_{θ} (τ^{i})$ . Let's break that down. We know that the probability of a trajectory is given as:

p_{θ} (τ) = \prod_{k = 0}^{T} P (x_{k + 1} | x_{k}, u_{k}) p_{θ} (u_{k} | x_{k})

and the log probability:

\log p_{θ} (τ) = \sum_{k = 0}^{T} [P (x_{k + 1} | x_{k}, u_{k}) + p_{θ} (u_{k} | x_{k})]

Now taking the gradient lets remove the transition probability because it does not depend on $θ$ :

\nabla \log p_{θ} (τ) = \sum_{k = 0}^{T} \nabla p_{θ} (u_{k} | x_{k})

Calculating $p_{θ} (u_{k} | x_{k})$ is simple because we know how to find the probability given a sample from our distribution (think of finding the probability of a sample from a Gaussian). Therefore, we've found a tractable approximation of the policy gradient! A library like PyTorch will help us find the partial derivatives of every parameter in our model with respect to the policy gradient using automatic differentiation.