Dynamic Programming and Optimal Control Theory

11 Apr, 2025

Markov Decision Processes and the Optimal Control Problems

I've talked about Markov Decision Processes (MDP) quite a bit in the context of filtering and state estimation, but always left the choice for control input $u_{k}$ vague. In this post, I’ll walk through strategies for efficiently selecting and computing optimal control inputs within an MDP framework.

In essence, a Markov Decision Process describes an agent (described by a state $x_{k}$ ) within an environment. This agent takes actions within the environment to evolve its state over a sequence of discrete—but not necessarily finite—timesteps ( $k \in {0, 1, 2, \dots}$ ).

At each timestep, the agent observes its current state $x_{k} \in X$ (likely using some kind of filtering to provide a better estimate), selects an action $u_{k} \in U$ , transitions to a new state (either deterministically or stochastically) according to the state-action pair, and receives a corresponding reward or cost.

Quick Note:

The line "transitions to a new state according to the state-action pair" implies a key assumption made by MDPs: the Markov Assumption. It states that the future state is solely dependent on the current state (and a current control input), not previous states. This assumption is key for both simplicity and tractability, but it can be a stretch if we don't design our model well. It's critical that the state $x_{k}$ captures all relevant information from the past that influences future evolution. Designing good state representations is therefore critical to the performance of an MDP.

$X$ refers to the state-space: a set of available states the model can be in; $U$ , the action-space: a set of possible actions the agent can take. Oftentimes, the action-space is constrained by our state. In a grid world with immovable walls, the action space shouldn't include moves that would put the agent into the wall. Because of this, people sometimes write $u_{k} \in U (x_{k})$ even for deterministic systems. Note that this is different than a control feedback $u_{k} (x_{k}) \in U$ —I'll get into that later if you're not familiar.

As for the transition model, I noted that this process can be deterministic or stochastic. In a system with deterministic dynamics, the transitions can be written as follows:

x_{k + 1} = f_{k} (x_{k}, u_{k})

$f_{k}$ is process model. In a stochastic system,

x_{k + 1} = f_{k} (x_{k}, u_{k}) + ϵ_{k}

where $ϵ_{k}$ is noise sampled from some probability distribution, known or unknown. A more informative representation models the dynamics as a probability distribution:

x_{k + 1} ~ P (\cdot | x_{k}, u_{k})

Finally, we have the reward/cost. These values are used to determine the optimal control inputs and are what differentiate an MDP from a regular Markov process. It's what enables optimization altogether because we have something to optimize. Different resources will use reward maximization as an objective and others will use cost minimization. In this post, I'll use cost minimization.

A cost function basically spits out a number for how “good” (or bad) our current choice of something is. I leave that vague because the formulation of the cost function is a modeling choice—it could depend only on the current state $x_{k}$ , the action $u_{k}$ , the resulting next state $x_{k + 1}$ , or some combination of those. Since most real-world systems are stochastic and a given state-action pair can deposit us to any number of next states, I think it's fitting to primarily formulate our cost function as having a specific state transition inputted. To align my notation with the class I'm taking on this (which uses a slightly different notation that I feel is slightly ambiguous), I'll define this cost function using $C$ :

c_{k} = C_{k} (x_{k}, u_{k}, x_{k + 1})

Then, to get a more general value of the cost of a state-action pair, we can marginalize over the possible next states:

q_{k} (x_{k}, u_{k}) = \sum_{x_{k + 1}} p (x_{k + 1} | x_{k}, u_{k}) \cdot C_{k} (x_{k}, u_{k}, x_{k + 1})

Loss over a trajectory (the deterministic case)

If we already have a sequence of control inputs for a finite trajectory in mind, a function to refer to the overall cost is quite simple: it's the sum of run-time costs plus a terminal cost. The terminal cost represents how well we did against some desired end state we wanted to end up in. If our optimal control problem was supposed to provide us an easy route to school and we didn't end up at school, the cost should be high! I represent this with $q_{f} (x_{T})$ .

Therefore, the cost $J$ of a particularly trajectory starting at $x_{0}$ is:

J (x_{0}; u_{0 : T - 1}) = q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} q_{k} (x_{k}, u_{k})

The optimal cost will be represented as $J * (x)$ :

J * (x_{0}) = {min}_{u_{0 : T - 1} \in U^{T}} J (x_{0}; u_{0 : T - 1}) = {min}_{u_{0 : T - 1} \in U^{T}} q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} q_{k} (x_{k}, u_{k})

There are two things to note about this method:

It's inefficient. We have to consider $| U |$ options for each timestep, meaning the time complexity is $O (| U |^{T} \cdot T)$ . I'll provide a better algorithm soon when I go into dynamic programming.
This one is important: Ahead-of-time planning really only makes sense in deterministic settings. In the deterministic setting, after taking a certain sequence determined by the starting state and control input sequence, we know where we'll be. In the stochastic setting, the realization of noise affects where a given state-action pair will output us. For this reason, it makes much more sense to adopt a control feedback approach, where the control input is actually responsive to what state we're in. If we did want to do open loop control with this strategy, the time complexity would be astronomical. For each timestep in the summation, we could be in any starting state $x_{k}$ and next state for a fixed control input $x_{k + 1}$ ! This means the final time complexity would be $O (| X |^{T} \cdot T \cdot | U |^{T})$ because we can take any sequence of states. Shockingly intractable. Regardless of the time complexity, we almost never want to do this because the noise introduces unpredictability.

Loss over a trajectory (the stochastic case)

Recall how I said earlier that the closed loop approach doesn't make sense. Well, we need control feedback: a function which maps a state to a control input. Our policy is the sequence of feedback controls:

π = (u_{0} (\cdot), u_{1} (\cdot), \dots, u_{T - 1} (\cdot))

Revisiting the cost function from before, some adjustments need to be made for the stochastic case. Instead of just directly choosing the control inputs that minimize the cost, we need to pick the the policy that minimizes the expected cost over all possible state trajectories. That's a bit hard to decipher, so here's the original cost function in terms of the expectation:

J (x_{0}, π) = 𝔼_{π} [q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} C_{k} (x_{k + 1}, x_{k}, u_{k} (x_{k}))]

Now to expand the expectation we need to sum over all the probability-weighted, possible trajectory costs:

J (x_{0}, π) = \sum_{x_{1}} \dots \sum_{x_{T}} [(\prod_{k = 0}^{T - 1} P (x_{k + 1} | x_{k}, u_{k} (x_{k})) \cdot (q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} C_{k} (x_{k + 1}, u_{k} (x_{k}), x_{k}))]

We can also write this in terms of $q_{k}$ instead, and the expectation over next states is baked into the sum:

J (x_{0}, π) = \sum_{x_{1}} \dots \sum_{x_{T}} [(\prod_{k = 0}^{T - 1} P (x_{k + 1} | x_{k}, u_{k} (x_{k})) \cdot (q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} q_{k} (x_{k}, u_{k} (x_{k})))]

The product represents the probability of the entire trajectory. It's a lot simpler to find compute this probability due to the Markov Property. Otherwise, if the next state was dependent on all previous states, it'd be $\prod_{k = 0}^{T - 1} P (x_{k + 1} | x_{1 : k}, u_{1 : k} (x_{1 : k}))$ . The other term getting multiplied by this probability is just the cost of that trajectory. Both of these terms require $T$ computations so the time complexity is $O (T)$ . The outer sum represents the sum of these weighted trajectories over all possible state trajectories. There are $| X |$ options per time-step and $T$ time steps, so $| X |^{T}$ possible trajectories. Therefore, the full time complexity to compute the cost of the trajectory is $O (| X |^{T} T)$ .

How about the minimization over all policies? Hint, it's way worse.

J * (x_{0}) = {min}_{π} 𝔼_{π} [q_{f} (x_{T}) + \sum_{k = 0}^{T - 1} q_{k} (x_{k}, u_{k} (x_{k})]

Each control feedback function has $| X |$ elements where each value can take $| U |$ values. This means, for each control feedback, there are $| U |^{| X |}$ options. We're searching over $T$ of these policies, meaning the search has to try $| U |^{| X |^{| T |}} = | U |^{| X | T}$ different options. As I said, the cost for each of these policies is $O (| X |^{T} T)$ . Therefore, the final time complexity is $O (| X |^{T} \cdot T \cdot | U |^{| X | T})$ . If this doesn't motivate a better algorithm, I don't know what does.

Dynamic Programming

Enter Dynamic Programming (DP). DP is a massive part of Optimal Control Theory and it relies on one key fact: the tail of an optimal path is also optimal.

Proof that the tail of an optimal path is optimal

This fact can easily be proven by contradiction.

Suppose we have an optimal path $u *_{0 : T - 1}$ . Assume the tail of the path $u * + k : T - 1$ is not optimal so there exists some other tail which has a lower cost $J (x_{k}; {\hat{u}}_{k : T - 1}) < J (x_{k}; u *_{k : T - 1})$ . Then we could use this tail to construct a full path which has a lower cost, which contradicts our assumption that we had an optimal path. Boom.

Bellman Equation

This principle leads us to the Bellman equation, which forms the foundation of dynamic programming in optimal control. I'll first provide the deterministic equation since it's easier to look at. First, we initialize $J *_{T} (x) = q_{f} (x)$ for all $x \in X$ . Then, working backwards in time, we compute the following for all $x \in X$ :

J *_{k} (x) = {min}_{u_{k} \in U} {q_{k} (x, u_{k}) + J *_{k + 1} (f (x, u_{k})}

Once this algorithm terminates at $t = 0$ , we have the optimal cost-to-go for each starting state. Now, for each timestep and for each state, we have to search over all possible control inputs, yielding a time complexity of $O (T | X | | U |)$ ! Since we don't need to minimize over an entire sequence, we avoid the huge $| U |^{T}$ term.

Now that we understand the deterministic Bellman equation, it's time to attack the stochastic case. Here, I think it's worthwhile to show off the more explicit $C_{k}$ cost function and the marginalized $q_{k}$ . Like the detminisitc case, we can define the cost-to-go at time $k$ at state $x$ using a recursive approach to speed things up. Once again, initialize $J *_{T} (x)$ with the terminal cost for all $x \in X$ . I will give many equivalent forms below:

\begin{matrix} J *_{k} (x) = {min}_{u_{k} \in U} {𝔼_{x_{k + 1}} [C (x_{k + 1}, u_{k}, x_{k}) + J *_{k + 1} (x_{k + 1})} \\ J *_{k} (x) = {min}_{u_{k} \in U} {q_{k} (x_{k}, u_{k}) + 𝔼_{x_{k + 1}} [J *_{k + 1} (x_{k + 1})]} \\ J *_{k} (x) = {min}_{u_{k} \in U} {q_{k} (x_{k}, u_{k}) + 𝔼_{ϵ_{k}} [J *_{k + 1} (f (x_{k}, u_{k}) + ϵ_{k})]} \end{matrix}

We just run this algorithm in reverse and for all states at each timestep until we reach $k = 0$ . I'll first analyze the time complexity of a single minimization. In the worst case, the expectation requires an iteration over all $| X |$ states. The computation on the inside of the expectation is a constant (two lookups). Then, we search over control feedbacks at our current state $x$ . For a particular state, this leaves us with $| U |$ options to minimize over (or potentially $| U | (x)$ depending on the problem). Therefore, the cost per minimization is $| U | \times | X |$ . We need to compute this for each state $x \in X$ and over all time steps, resulting in a final time complexity of $O (| X |^{2} T | U |)$ .

Finite-Horizon $\to$ Infinite-Horizon Optimal Control

Everything I've shown so far has assumed some finite time horizon $T$ . In some settings, a finite horizon is very natural. Consider the example from my class notes:

If the horizon T changes, say you are in a hurry to get to school, the optimal trajectory may take control inputs that incur a lot of runtime cost simply to reach closer to the goal state (something that keeps the terminal cost small).

There are many cases, however, where picking a time horizon is not so easy. What if we're trying to craft an optimal control policy for a robot to pick up a bottle? We probably want it to go relatively quickly, but how fast is not immediately clear. In these situations, we want to formulate the optimal control problem as something that "allows a trajectory infinite steps but also encourages the length of the trajectory to be small enough in order to be meaningful". These are infinite-horizon optimal control problems.

Stationary Dynamics and Costs

Since the length of the trajectory is infinite at any given time-step, it makes more sense to think about system dynamics and run-time cost as being time invariant (or stationary):

\begin{matrix} q (x, u) \equiv q_{k} (x, u) \\ f (x, u) \equiv f_{k} (x, u) \end{matrix}

Additionally, since there is no definitive end-time, we don't actually have a terminal cost. The distance to the goal must now be encoded in the run-time cost.

Instead of minimizing a finite set of control policies, we now want to minimize an infinite control policy:

π = (u_{0} (\cdot), u_{1} (\cdot), \dots)

To find the cost of a particular control policy for a starting state $x_{0}$ , we actually have to take the limit of the expectation as $T \to \infty$ :

J (x_{0}; π) = {lim}_{T \to \infty} 𝔼_{x_{1}, \dots, x_{T}} [\sum_{k = 0}^{T - 1} γ^{k} C (x_{k + 1}, u_{k} (x_{k}), x_{k})]

Now the optimal cost-to-go is:

J * (x_{0}) = {min}_{π} J * (x_{0}; π)

I'm sure you noticed that there was a $γ$ term. This is chosen so the cost-to-go does not diverge to infinity. It puts higher emphasis on earlier costs than later, therefore encouraging shorter trajectories. The gamma term actually forms a geometric series with a known upper bound when $γ \in (0, 1)$ . Since $\sum_{k = 0}^{\infty} γ^{k} = \frac{1}{1 - γ}$ , so long as $C$ is bounded in some way (let's say by a constant $α$ ), our cost-to-go will be bounded by $\frac{α}{1 - γ}$ . $γ$ values closer to 1 will result in a longer time-horizon and vice-versa.

Value Iteration

In finite horizon problems, we have a clear endpoint (terminal state) that gives us a natural direction to work: backwards. We know the value at the final time step and can recursively work backward to determine optimal actions at each preceding state.

For infinite horizon problems, we don't have this terminal state to anchor our computation. Instead, we're looking for a stationary policy - one that doesn't depend on time, just on the state. The key intuition shifts from "working backward from the end" to "finding a fixed point of the Bellman equation." It's a much more global problem. For each state in our state space, we're iteratively improving a guess of the cost-to-go until we reach a certain steady state (which will be close to the optimal cost-to-go). From there, we can easily trace back to get the approximately optimal policy.

One iterative algorithm for finding the optimal cost-to-go at each point in space is called value iteration. Under this algorithm, we iteratively develop a sequence of cost-to-go estimates for each state in the state space until we roughly converge. It goes like this:

Initialize $J^{(0)} (x) = 0$ for all $x \in X$ .
Update using the Bellman equation at each iteration for all states in the state space: $J^{(i + 1)} (x) = {min}_{u \in U} 𝔼_{x_{i + 1}} [q (x_{i}, u) + γ J^{(i)} (x_{i + 1})]$ . Do this until the value converges at all states according to some threshold: $| J^{(i)} (x) - J^{(i + 1) (x)} < threshold |$ .
Let the number of iterations required for convergence be $N$ . We can reconstruct the stationary policy: $u * (x) = {argmin}_{u \in U} 𝔼_{x_{k + 1}} [q (x, u) + γ J^{(N)} (x_{k + 1})]$

For now, disregard the fact that I've phrased the run-time cost in terms of $q$ and not $C$ ; assume the problem is set up such that it doesn't matter what next state the state-action pair deposits us in with respect to the run-time cost.

Policy Iteration

So far, value iteration was all about iteratively improving our guess of the cost-to-go and then recovering the optimal policy afterward. But there’s another approach that flips this: Policy Iteration. Instead of starting with a cost guess, we start with a policy and improve it directly. If you’ve got a decent initial guess, this can converge fast.

Policy Iteration alternates between two key steps:

Policy Evaluation: Given a policy $π$ , compute the cost-to-go function $J^{π} (x)$ .
Policy Improvement: Use $J^{π} (x)$ to construct a better policy $π^{'}$ .

Repeat this until the policy stops changing—then you're at (or very close to) the optimal policy.

Step 1: Policy Evaluation

If we’re handed a stationary policy $π$ (i.e. $u = π (x)$ ), then the Bellman equation becomes a linear system:

J^{π} (x) = q (x, π (x)) + γ \sum_{x^{'}} P (x^{'} | x, π (x)) J^{π} (x^{'})

You do this for all $x \in X$ . It’s just a system of $| X |$ equations in $| X |$ unknowns—can solve it with linear algebra or iterative methods like Gauss-Seidel if $X$ is big.

Step 2: Policy Improvement

Once you’ve got $J^{π} (x)$ , ask: “Could I do better from this state?”

So for each state, compute:

π_{new} (x) = {argmin}_{u \in U} [q (x, u) + γ \sum_{x^{'}} P (x^{'} | x, u) J^{π} (x^{'})]

If $π_{new} (x) = π (x)$ for all $x$ , then you’ve converged. Otherwise, repeat the steps with this new policy.

Algorithm Summary

Initialize an arbitrary policy $π_{0}$
Loop until convergence:
- Evaluate $J^{π_{i}} (x)$ for all $x$
- Improve the policy: $π_{i + 1} (x) = {argmin}_{u \in U} [q (x, u) + γ \sum_{x^{'}} P (x^{'} | x, u) J^{π_{i}} (x^{'})]$

This method is usually faster to converge than value iteration when you start near a good policy, since evaluation can be done exactly (or near-exactly) in one sweep.

Wrap-Up

There's a lot more to say and explain about these algorithms, but they're less useful for real-world robotics since we're usually dealing with significantly larger state and action spaces. I might revisit this at some point or provide some demos, but not now.