New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Efficient computation of optimal actions

Edited by James L. McClelland, Stanford University, Stanford, CA, and approved April 28, 2009 (received for review November 16, 2007)
Abstract
Optimal choice of actions is a fundamental problem relevant to fields as diverse as neuroscience, psychology, economics, computer science, and control engineering. Despite this broad relevance the abstract setting is similar: we have an agent choosing actions over time, an uncertain dynamical system whose state is affected by those actions, and a performance criterion that the agent seeks to optimize. Solving problems of this kind remains hard, in part, because of overly generic formulations. Here, we propose a more structured formulation that greatly simplifies the construction of optimal control laws in both discrete and continuous domains. An exhaustive search over actions is avoided and the problem becomes linear. This yields algorithms that outperform Dynamic Programming and Reinforcement Learning, and thereby solve traditional problems more efficiently. Our framework also enables computations that were not possible before: composing optimal control laws by mixing primitives, applying deterministic methods to stochastic systems, quantifying the benefits of error tolerance, and inferring goals from behavioral data via convex optimization. Development of a general class of easily solvable problems tends to accelerate progress—as linear systems theory has done, for example. Our framework may have similar impact in fields where optimal choice of actions is relevant.
If you are going to act, you might as well act in the best way possible. But which way is best? This is the general problem we consider here. Examples include a nervous system generating muscle activations to maximize movement performance (1), a foraging animal deciding which way to turn to maximize food (2), an internet router directing packets to minimize delays (3), an onboard computer controlling a jet engine to minimize fuel consumption (4), and an investor choosing transactions to maximize wealth (5). Such problems are often formalized as Markov decision processes (MDPs), with stochastic dynamics p(x′x,u) specifying the transition probability from state x to state x′ under action u, and immediate cost ℓ(x,u) for being in state x and choosing action u. The performance criterion that the agent seeks to optimize is some cumulative cost that can be formulated in multiple ways. Throughout the article we focus on one formulation (total cost with terminal/goal states) and summarize results for other formulations.
Optimal actions cannot be found by greedy optimization of the immediate cost, but instead must take into account all future costs. This is a daunting task because the number of possible futures grows exponentially with time. What makes the task doable is the optimal costtogo function v(x) defined as the expected cumulative cost for starting at state x and acting optimally thereafter. It compresses all relevant information about the future and thus enables greedy computation of optimal actions. v(x) equals the minimum (over actions u) of the immediate cost ℓ(x,u) plus the expected costtogo E[v(x′)] at the next state x′: The subscript indicates that the expectation is taken with respect to the transition probability distribution p(·x,u) induced by action u. Eq. 1 is fundamental to optimal control theory and is called the Bellman equation. It gives rise to Dynamic Programming (3) and Reinforcement Learning (2) methods that are very general but can be inefficient. Indeed, Eq. 1 characterizes v(x) only implicitly, as the solution to an unsolved optimization problem, impeding both analytical and numerical approaches.
Here, we show how the Bellman equation can be greatly simplified. We find an analytical solution for the optimal u given v, and then transform Eq. 1 into a linear equation. Short of solving the entire problem analytically, reducing optimal control to a linear equation is the best one can hope for. This simplification comes at a modest price: although we impose certain structure on the problem formulation, most control problems of practical interest can still be handled. In discrete domains our work has no precursors. In continuous domains there exists related prior work (6–8) that we build on here. Additional results can be found in our recent conference articles (9–11), online preprints (12–14), and supplementary notes [supporting information (SI) Appendix].
Results
Reducing Optimal Control to a Linear Problem.
We aim to construct a general class of MDPs where the exhaustive search over actions is replaced with an analytical solution. Discrete optimization problems rarely have analytical solutions, thus our agenda calls for continuous actions. This may seem counterintuitive if one thinks of actions as symbols (“go left,” “go right”). However, what gives meaning to such symbols are the underlying transition probabilities—which are continuous. The latter observation is key to the framework developed here. Instead of asking the agent to specify symbolic actions, which are then replaced with transition probabilities, we allow the agent to specify transition probabilities u(x′x) directly. Formally, we have p(x′x,u) = u(x′x).
Thus, our agent has the power to reshape the dynamics in any way it wishes. However, it pays a price for too much reshaping, as follows. Let p(x′x) denote the passive dynamics characterizing the behavior of the system in the absence of controls. The latter will usually be defined as a random walk in discrete domains and as a diffusion process in continuous domains. Note that the notion of passive dynamics is common in continuous domains but is rarely used in discrete domains. We can now quantify how “large” an action is by measuring the difference between u(·x) and p(·x). Differences between probability distributions are usually measured via Kullback–Leibler (KL) divergence, suggesting an immediate cost of the form
The state cost q(x) can be an arbitrary function encoding how (un)desirable different states are. The passive dynamics p(x′x) and controlled dynamics u(x′x) can also be arbitrary, except that we require u(x′x) = 0 whenever p(x′x) = 0. This constraint is needed to make KL divergence welldefined. It has the added benefit of preventing the agent from jumping directly to goal states, and more generally from making state transitions that are physically impossible.
Fig. 1 illustrates the construction above with two simple examples. Fig. 1A is a cointoss problem where q(Tails) = 0, q(Heads) = 1 and the passive dynamics correspond to an unbiased coin. The action u has the effect of biasing the coin. The optimal bias, which turns out to be u*(Heads) = 0.27, achieves a tradeoff between keeping the action cost and the expected state cost small. Note that the controller could have made the coin deterministic by setting u(Heads) = 0, but this is suboptimal because the associated action cost is too large. In general, the optimal actions resulting from our framework are stochastic. Fig. 1B is a shortestpath problem where q = 0 for the goal state (gray) and q = 1 for all other states. The passive dynamics correspond to the random walk on the graph. At the green state it does not matter which path is taken, so the optimal action equals the passive dynamics, the action cost is 0, and the costtogo (shown inside the circle) equals the length of the deterministic shortest path. At the red state, however, the optimal action deviates from the passive dynamics to cause a transition up, incurring an action cost of 0.6 and making the red state worse than the green state. In general, v(x) is smaller when the task can be accomplished in multiple ways starting from x. This reflects a preference for error tolerance that is inherent in our framework.
We now return to the theoretical development. The results take on a simpler form when expressed in terms of the desirability function This terminology reflects the fact that z is large at states where the costtogo v is small. Substituting Eq. 1 in Eq. 2, the Bellman equation can be written in terms of z as The expression being minimized resembles KL divergence between u and pz, except that pz is not normalized to sum to 1. Thus, to obtain proper KL divergence (SI Appendix), we have to multiply and divide by the normalization term Recall that KL divergence achieves its global minimum of zero when the 2 distributions are equal. Therefore, the optimal action u* is proportional to pz: This represents the first general class of MDPs where the optimal actions can be found analytically given the optimal coststogo. Previously, such results were available only in continuous domains.
The Bellman equation can now be simplified (SI Appendix) by substituting the optimal action, taking into account the normalization term and exponentiating. The result is The expectation G[z] is a linear operator; thus, Eq. 7 is linear in z. It can be written more compactly in vector notation. Enumerate the states from 1 to n, represent z(x) and q(x) with the ndimensional column vectors z and q, and p(x′x) with the nbyn matrix P, where the rowindex corresponds to x and the columnindex to x′. Then Eq. 7 becomes z = Mz, where M = diag(exp(−q))P, exp is applied elementwise and diag transforms vectors into diagonal matrices. The latter equation looks like an eigenvector problem, and indeed it can be solved (9) by using the power iteration method z ← Mz (which we call Z iteration). However, the problem here is actually simpler because the eigenvalue is 1 and v(x) = q(x) at terminal states. If we define the index sets T and N of terminal and nonterminal states and partition z, q, and P accordingly, Eq. 7 becomes The unknown z_{N} is the vector of desirabilities at the nonterminal states. It can be computed via matrix factorization or by using an iterative linear solver.
Let us now compare our result (Eq. 8) with policy iteration (3). We have to solve an equation of the form Az = b just once. In policy iteration one has to solve an equation of the form A_{(π)}v = b_{(π)} to evaluate the current policy π; then, the policy has to be improved and the process repeated. Therefore, solving an optimal control problem in our formulation is computationally equivalent to half a step of policy iteration.
Thus far, we have studied MDPs with discrete state spaces. There exists a family of continuous (in space and time) problems related to our MDPs. These problems have stochastic dynamics ω(t) is Brownian motion and σ is the noise amplitude. The cost rate is of the form The functions q(x), a(x), and B(x) can be arbitrary. This problem formulation is fairly general and standard (but see Discussion). Consider, for example, a onedimensional point with mass m, position x_{p}, and velocity x_{v}. Then, x = [x_{p},x_{v}]^{T}, a(x) = [x_{v},0]^{T}, and B = [0,m^{−1}]^{T}. The noise and control signals correspond to external forces applied to the point mass.
Unlike the discrete case where the agent could specify the transition probability distribution directly, here, u is a vector that can shift the distribution given by the passive dynamics but cannot reshape it. Specifically, if we discretize the time axis in Eq. 9 with step h, the passive dynamics are Gaussian with mean x + ha(x) and covariance hσ^{2}B(x)B(x)^{T}, whereas the controlled dynamics are Gaussian with mean x + ha(x) + hB(x)u and the same covariance. Thus, the agent in the continuous setting has less freedom compared with the discrete setting. Yet the two settings share many similarities, as follows. First, the KL divergence between the above Gaussians can be shown to be
The linear equations 7 and 12 were derived by minimizing total cost in the presence of terminal/goal states. Similar results can be obtained (SI Appendix) for all other performance criteria used in practice, in both discrete and continuous settings. They are summarized in Table 1. The constant c is the (unknown) average cost computed as the principal eigenvalue of the corresponding equation. z̃ is the exponent of the differential costtogo function (3) (SI Appendix). The constant α is the exponential discount factor. Note that all equations are linear except in the discounted case. The finitehorizon and totalcost results in continuous settings were previously known (6, 7); the remaining six equations were derived in our work.
Applications.
The first application is an algorithm for finding shortest paths in graphs (recall Fig. 1B). Let s(x) denote the length of the shortest path from state x to the nearest terminal state. The passive dynamics p correspond to the random walk on the graph. The state costs are q = ρ > 0 for nonterminal states and q = 0 for terminal states. Let v_{ρ}(x) denote the optimal costtogo function for given ρ. If the action costs were 0, the shortest path lengths would be
The second application is a method for continuous embedding of traditional MDPs with symbolic actions. It is reminiscent of linear programming relaxation in integer programming. Denote the symbolic actions in the traditional MDP with a, the transition probabilities with
The third application is a Monte Carlo method for learning z from experience in the absence of a model of the passive dynamics p and state costs q. Unlike the previous two applications, where we started with traditional MDPs and approximated them with MDPs in our class, here we assume that the problem is already in our class. The linear Bellman Eq. 7 can be unfolded recursively by replacing z(x′) with the expectation over the state following x′ and so on, and then pushing all state costs inside the expectation operators. This yields the pathintegral representation (x_{0},x_{1}⋯x_{tf}) are trajectories initialized at x_{0} = x and sampled from the passive dynamics, and t_{f} is the time when a goal state is first reached. A similar result holds in continuous settings, where the Feynman–Kac theorem states that the unique positive solution to Eq. 12 has a pathintegral representation. The use of Monte Carlo methods for solving continuous optimal control problems was pioneered in ref. 7. Our result (Eq. 16) makes it possible to apply such methods to discrete problems with nonGaussian noise.
The fourth application is related to the pathintegral approach above but aims to achieve faster convergence. It is motivated by Reinforcement Learning (2) where faster convergence is often observed by using Temporal Difference (TD) methods. A TDlike method for our MDPs can be obtained from Eq. 7. It constructs an approximation
The fifth application accelerates MDP approximations to continuous problems (16). Such MDPs are obtained via discretization and tend to be very large, calling for efficient numerical methods. Because continuous problems of the form (Eqs. 9 and 10) are limits of MDPs in our class, they can be approximated with MDPs in our class, which, in turn, are reduced to linear equations and solved efficiently. The same continuous problems can also be approximated with traditional MDPs and solved via dynamic programming. Both approximations converge to the same solution in the limit of infinitely fine discretization, and turn out to be equally accurate away from the limit, but our approximation is faster to compute. This is shown in Fig. 5 on a caronahill problem, where our method is ≈10 times faster than policy iteration and 100 times faster than value iteration.
The remaining applications have been developed recently. Below we summarize the key results and refer the reader to online preprints (12–14). The sixth application is a deterministic method for computing the most likely trajectory of the optimaly controlled stochastic system. Combining Eqs. 6 and 7, the optimal control law can be written as
The seventh application exploits the duality between Bayesian estimation and stochastic optimal control. This duality is wellknown for linear systems (4) and was recently generalized (8, 10) to nonlinear continuous systems of the form Eqs. 9 and 10. Duality also holds for our MDPs. Indeed, Eq. 18 can be interpreted as the posterior probability of a trajectory in a Bayesian smoothing problem, where p(x_{t+1}x_{t}) are the prior probabilities of the state transitions and exp(−q(x_{t})) are the likelihoods of some unspecified measurements. The desirability function in the control problem can be shown to be proportional to the backward filtering density in the dual estimation problem. Thus, stochastic optimal control problems can be solved by using Bayesian inference. See refs. 10 and 12 for details.
The eighth application concerns inverse optimal control, where the goal is to infer q(x) given a dataset {x_{k},x′_{k}}_{k=1⋯K} of state transitions generated by the optimal controller. Existing methods rely on guessing the cost function, solving the forward problem, comparing the solution with data, and improving the guess and iterating (18, 19). This indirect procedure can be inefficient when the forward problem is hard to solve. For our MDPs the inverse problem can be solved directly, by minimizing the convex function where v is the vector of (unknown) optimal coststogo, P is the passive dynamics matrix defined earlier, and d and c are the histograms of x′_{k} and x_{k} (i.e., d_{i} is the number of times that x′_{k} = i). It can be shown that L(v) is the negative loglikelihood of the dataset. We have found empirically that its Hessian tends to be diagonally dominant, which motivates an efficient quasiNewton method by using a diagonal approximation to the Hessian. Once the minimum is found, we can compute z and then find q from the linear Bellman equation. Fig. 5D illustrates this inverse optimal control method on the caronahill problem. See ref. 14 for details.
The ninth application is a method for constructing optimal control laws as combinations of certain primitives. This can be done in finitehorizon as well as totalcost problems with terminal/goal states, where the final cost plays the role of a boundary condition. Suppose we have a collection of component final costs g_{i}(x) for which we have somehow computed the desirability functions z_{i}(x). Linearity implies that, if the composite final cost g(x) satisfies
for some set of w_{i}, then the composite desirability function is
Discussion
We formulated the problem of stochastic optimal control in a way which is rather general and yet affords substantial simplifications. Exhaustive search over actions was replaced with an analytical solution and the computation of the optimal costtogo function was reduced to a linear problem. This gave rise to efficient new algorithms speeding up the construction of optimal control laws. Furthermore, our framework enabled a number of computations that were not possible previously: solving problems with nonquadratic final costs by mixing LQG primitives, finding the most likely trajectories of optimally controlled stochastic systems via deterministic methods, solving inverse optimal control problems via convex optimization, quantifying the benefits of error tolerance, and applying offpolicy learning in the state space as opposed to the stateaction space.
These advances were made possible by imposing a certain structure on the problem formulation. First, the control cost must be a KL divergence—which reduces to the familiar quadratic energy cost in continuous settings. This is a sensible way to measure control energy and is not particularly restrictive. Second, the controls and the noise must be able to cause the same state transitions; the analog in continuous settings is that both the controls and the noise act in the subspace spanned by the columns of B(x). This is a more significant limitation. It prevents us from modeling systems subject to disturbances outside the actuation space. We are now pursuing an extension that aims to relax this limitation while preserving many of the appealing properties of our framework. Third, the noise amplitude and the control costs are coupled, and, in particular, the control costs are large when the noise amplitude is small. This can be compensated to some extent by increasing the state costs while ensuring that exp(−q(x)) does not become numerically zero. Fourth, with regard to Z learning, following a policy other than the passive dynamics p requires importancesampling correction based on a model of p. Such a model could presumably be learned online; however, this extension remains to be developed.
This framework has many potential applications and we hope that the list of examples will grow as other researchers begin to use it. Our current focus is on highdimensional continuous problems such as those arising in biomechanics and robotics, where the discretetime continuousstate MDP approximation is particularly promising. It leads to linear integral equations rather than differential equations, resulting in robust numerical methods. Furthermore it is well suited to handle discontinuities due to rigidbody collisions. Initial results using mixtureofGaussian approximations to the desirability function are encouraging (11), yet a lot more work remains.
Acknowledgments
We thank Surya Ganguli, Javier Movellan, and Yuval Tassa for discussions and comments on the manuscript. This work was supported by the National Science Foundation.
Footnotes
 ^{1}Email: todorov{at}cs.washington.edu

Author contributions: E.T. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

See Commentary on Page 11429.

This article contains supporting information online at www.pnas.org/cgi/content/full/0710743106/DCSupplemental.

Freely available online through the PNAS open access option.
References
 ↵
 ↵
 Sutton R,
 Barto A
 ↵
 Bertsekas D
 ↵
 Stengel R
 ↵
 Korn R
 ↵
 Fleming W,
 Mitter S
 ↵
 ↵
 ↵
 Todorov E
 ↵
 Todorov E
 ↵
 Todorov E
 ↵
 Todorov E
 ↵
 Todorov E
 ↵
 Todorov E
 ↵
 Watkins C,
 Dayan P
 ↵
 Kushner H,
 Dupuis P
 ↵
 Bryson A,
 Ho Y
 ↵
 Ng A,
 Russell S
 ↵
Citation Manager Formats
More Articles of This Classification
Physical Sciences
Computer Sciences
Biological Sciences
Neuroscience
Related Content
Cited by...
 A spiking neural model of adaptive arm control
 Engagement of the Rat Hindlimb Motor Cortex across Natural Locomotor Behaviors
 The effect of model uncertainty on cooperation in sensorimotor interactions
 Thermodynamics as a theory of decisionmaking with informationprocessing costs
 How can we learn efficiently to act optimally and flexibly?