# Fast reinforcement learning with generalized policy updates

See allHide authors and affiliations

Edited by David L. Donoho, Stanford University, Stanford, CA, and approved July 9, 2020 (received for review July 20, 2019)

## Abstract

The combination of reinforcement learning with deep learning is a promising approach to tackle important sequential decision-making problems that are currently intractable. One obstacle to overcome is the amount of data needed by learning systems of this type. In this article, we propose to address this issue through a divide-and-conquer approach. We argue that complex decision problems can be naturally decomposed into multiple tasks that unfold in sequence or in parallel. By associating each task with a reward function, this problem decomposition can be seamlessly accommodated within the standard reinforcement-learning formalism. The specific way we do so is through a generalization of two fundamental operations in reinforcement learning: policy improvement and policy evaluation. The generalized version of these operations allow one to leverage the solution of some tasks to speed up the solution of others. If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression. When this is not the case, the agent can still exploit the task solutions by using them to interact with and learn about the environment. Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.

- artificial intelligence
- reinforcement learning
- generalized policy improvement
- generalized policy evaluation
- successor features

Reinforcement learning (RL) provides a conceptual framework to address a fundamental problem in artificial intelligence: the development of situated agents that learn how to behave while interacting with the environment (1). In RL, this problem is formulated as an agent-centric optimization in which the goal is to select actions to get as much reward as possible in the long run. Many problems of interest are covered by such an inclusive definition; not surprisingly, RL has been used to model robots (2) and animals (3), to simulate real (4) and artificial (5) limbs, to play board (6) and card (7) games, and to drive simulated bicycles (8) and radio-controlled helicopters (9), to cite just a few applications.

Recently, the combination of RL with deep learning added several impressive achievements to the above list (10⇓⇓–13). It does not seem too far-fetched, nowadays, to expect that these techniques will be part of the solution of important open problems. One obstacle to overcome on the track to make this possibility a reality is the enormous amount of data needed for an RL agent to learn to perform a task. To give an idea, in order to learn how to play one game of Atari, a simple video game console from the 1980s, an agent typically consumes an amount of data corresponding to several weeks of uninterrupted playing; it has been shown that, in some cases, humans are able to reach the same performance level in around 15 min (14).

One reason for such a discrepancy in the use of data is that, unlike humans, RL agents usually learn to perform a task essentially from scratch. This suggests that the range of problems our agents can tackle can be significantly extended if they are endowed with the appropriate mechanisms to leverage prior knowledge. In this article we describe one such mechanism. The framework we discuss is based on the premise that an RL problem can usually be decomposed into a multitude of “tasks.” The intuition is that, in complex problems, such tasks will naturally emerge as recurrent patterns in the agent–environment interaction. Tasks then lead to a useful form of prior knowledge: once the agent has solved a few, it should be able to reuse their solution to solve other tasks faster. Grasping an object, moving in a certain direction, and seeking some sensory stimulus are examples of naturally occurring behavioral building blocks that can be learned and reused.

Ideally, information should be exchanged across tasks whenever useful, preferably through a mechanism that integrates seamlessly into the standard RL formalism. The way this is achieved here is to have each task be defined by a reward function. As most animal trainers would probably attest, rewards are a natural mechanism to define tasks because they can induce very distinct behaviors under otherwise identical conditions (15, 16). They also provide a unifying language that allows each task to be cast as an RL problem, blurring the (somewhat arbitrary) distinction between the two (17). Under this view, a conventional RL problem is conceptually indistinguishable from a task self-imposed by the agent by “imagining” recompenses associated with arbitrary events. The problem then becomes a stream of tasks, unfolding in sequence or in parallel, whose solutions inform each other in some way.

What allows the solution of one task to speed up the solution of other tasks is a generalization of two fundamental operations underlying much of RL: policy evaluation and policy improvement. The generalized version of these procedures, jointly referred to as “generalized policy updates,” extend their standard counterparts from point-based to set-based operations. This makes it possible to reuse the solution of tasks in two distinct ways. When the reward function of a task can be reasonably approximated as a linear combination of other tasks’ reward functions, the RL problem can be reduced to a simpler linear regression that is solvable with only a fraction of the data. When the linearity constraint is not satisfied, the agent can still leverage the solution of tasks—in this case, by using them to interact with and learn about the environment. This can also considerably reduce the amount of data needed to solve the problem. Together, these two strategies give rise to a divide-and-conquer approach to RL that can potentially help scale our agents to problems that are currently intractable.

## RL

We consider the RL framework outlined in the Introduction: an agent interacts with an environment by selecting actions to get as much reward as possible in the long run (1). This interaction happens at discrete time steps, and, as usual, we assume it can be modeled as a Markov decision process (MDP) (18).

An MDP is a tuple

A sample transition is a tuple

Given an MDP representing a task r, there exists at least one optimal policy

### Policy Updates.

RL algorithms based on dynamic programming build on two fundamental operations (1).

#### Definition 1.

“Policy evaluation” is the computation of

#### Definition 2.

Given a policy π and a task r, “policy improvement” is the definition of a policy

What makes policy evaluation tractable is a recursive relation between state-action values known as the Bellman equation:**3** induces a system of linear equations whose solution is **3** based on samples from

As for policy improvement, it is in fact simple to define a policy

The specific way policy updates are carried out gives rise to different dynamic programming algorithms. For example, value iteration and policy iteration can be seen as the extremes of a spectrum of algorithms defined by the extent of the policy evaluation step (19, 24). RL algorithms based on dynamic programming can be understood as stochastic approximations of these methods or other instantiations of policy updates (25).

## Generalized Policy Updates

From the discussion above, one can see that an important branch of the field of RL depends fundamentally on the notions of policy evaluation and policy improvement. We now discuss generalizations of these operations.

#### Definition 3.

“Generalized policy evaluation” (GPE) is the computation of the value function of a policy π on a set of tasks R.

#### Definition 4.

Given a set of policies Π and a task r, “generalized policy improvement” (GPI) is the definition of a policy

Obviously, in order for GPE and GPI to be useful in practice, we must have efficient ways of performing these operations. Consider GPE, for example. If we were to individually evaluate the policies

### Fast GPE with Successor Features.

Conceptually, we can think of GPE as a function associated with a policy π that takes a task r as input and outputs a value function

Let ^{⊤} denotes the transpose of a vector. Let

Following Barreto et al. (28), we define the “successor features” (SFs) of policy π as*i*th component of **1**). As a consequence, SFs satisfy a Bellman equation analogous to Exp. **3**, which means that they can be computed using standard RL methods like temporal differences (22).

Given the SFs of a policy π,

The question then arises as to how inclusive the set

Although these limiting cases are reassuring, more generally we want to have a number of features *Fast RL with GPE and GPI*, we give concrete examples of what the features ϕ may look like and also discuss possible ways of computing them directly from data. For now, it is worth mentioning that, even when ϕ does not span the space R, one can bound the error between the two sides of Exp. **7** based on how well the tasks of interest can be approximated using ϕ (proposition 1 in ref. 30).

### Fast GPI with Value-Based Action Selection.

Thanks to the concept of value function, standard policy improvement can be carried out efficiently through Exp. **4**. This raises the question whether the same is possible with GPI. We now answer this question in the affirmative for the case in which GPI is applied over a finite set Π.

Following Barreto et al. (28), we propose to compute an improved policy **8** qualifies as a legitimate form of GPI as per Definition 4 (theorem 1 in ref. 28). Note that the action selected by a GPI policy **8** and methods that define a higher-level policy that alternates between the policies *SI Appendix*, we show that the policy **8** will in general outperform its analog defined over Π.

As discussed, GPI reduces to standard policy improvement when Π contains a single policy. Interestingly, this also happens with Exp. **8** when one of the policies in Π strictly outperforms the others on task r. This means that, after one application of GPI, adding the resulting policy **8** to Exp. **4**. Therefore, if GPI is used in a sequence of policy updates, it will collapse back to its standard counterpart after the first update. In contrast with policy improvement, which is inherently iterative, GPI is an one-off operation that yields a quick solution for a task when multiple policies have been evaluated on it.

The guarantees regarding the performance of the GPI policy **8** is used with approximate value functions. In this case, the lower bound in Exp. **5** decreases in proportion to the maximum approximation error in the value function approximations (theorem 1 in ref. 28). We refer the reader to *SI Appendix* for an extended discussion on GPI.

### The Generalized Policy.

Strictly speaking, in order to leverage the instantiations of GPE and GPI discussed in this article, we only need two elements: features **8** can be used to obtain a GPI policy *Fast GPI with Value-Based Action Selection*.

Given a set of SFs Ψ, approximate or otherwise, any **7**) and GPI (Exp. **8**). We can thus think of generalized updates as implementing a policy

Since each component of w weighs one of the features

## Fast RL with GPE and GPI

We now describe how to build and use the adaptable policy *SI Appendix*).

This simple environment can be seen as a prototypical member of the class of problems in which GPE and GPI could be useful. This becomes clear if we think of objects as instances of (potentially abstract) concepts, here symbolized by their types, and note that the navigation dynamics are a proxy for any sort of dynamics that mediate the interaction of the agent with the world. In addition, despite its small size, the number of possible configurations of the grid is actually quite large, of the order of

By changing the rewards associated with each object type, one can create different tasks. We will consider that the agent wants to build a set of SFs Ψ that give rise to a generalized policy

### Defining a Basis for Behavior.

In order to build the SFs Ψ, the agent must define two things: features ϕ and a set of policies Π. Since ϕ should be associated with rewarding events, we define each feature

Now that we have defined ϕ, we turn to the question of how to determine an appropriate set of policies Π. We will restrict the policies in Π to be solutions to tasks

We are now ready to compute the SFs Ψ induced by our choices of ϕ and Π. In our experiments, we used an algorithm analogous to Q-learning to compute approximate SFs *SI Appendix*). We represented the SFs using multilayer perceptrons with two hidden layers (33).

The set of SFs

Results are shown in Fig. 5*A*. As a reference, we also show the learning curve of Q-learning (23) using the same architecture to directly approximate

We used a total of *A*. As expected, the generalization ability of GPE and GPI depends on the set of policies used. Perhaps more surprising is how well the generalized policy

These experiments show that a proper choice of base policies Π can lead to good generalization over the entire set of tasks **5**). In addition to that, Barreto et al. have shown that it is possible to guarantee a minimum performance level for *B*, confirm the trend implied by the theory.

### Task Inference.

So far, we have assumed that the agent knows the vector w that describes the task of interest. Although this can be the case in some scenarios, ideally we would be able to apply GPE and GPI even when w is not provided. In this section and in *Preferences as Actions*, we describe two possible ways for the agent to learn w.

Given a task r, we are looking for a **6**, we can infer w by solving the following minimization:**7** and use GPE and GPI as we did before—that is, we have just turned an RL task into an easier linear regression problem.

To illustrate the potential of this approach, we revisited the task **9** using *SI Appendix*). Results are shown in Fig. 5 *A* and *B*. The performance of the generalized policy

The problem described in Exp. **9** can be solved even if the rewards are the only information available about a given task. Other interesting possibilities arise when the observations provided by the environment can be used to infer

So far, we have used handcrafted features ϕ. It turns out that Exp. **9** can be generalized to also allow ϕ to be inferred from data. Given sample transitions from k tasks, **10** can be solved as a multitask regression (34).

In Fig. 5*B*, we show the performance of **10** with **10** and solved Exp. **9** exactly as before but now using *SI Appendix*). The results in Fig. 5*B* also show the effect of varying the dimension of the learned features on the performance of

### Preferences as Actions.

In this section, we consider the scenario where, given features ϕ and a set of policies Π, there is no vector w that leads to acceptable performance of

One can represent this problem using the previously defined features ϕ by switching between two instantiations of w:

We can look at the computation of

To illustrate the benefits of applying this transformation to the problem, we will use the task described above in which the agent must pick up the type of object that is more abundant. We will tackle this problem by reusing the SFs

We compare the performance of

The reason is that

As a final remark, we note that the ability to chain a sequence of preference vectors w changes the nature of the problem of computing an appropriate set of features

### Lifelong Learning.

We have looked at the different subproblems involved in the use of GPE and GPI. For didactic purposes, we have discussed these problems in isolation, but it is instructive to consider how they can jointly emerge in a real application. One scenario where the problems above would naturally arise is “lifelong learning,” in which the interaction of the agent with the environment unfolds over a long period (36). Since we expect some patterns to repeat during such an interaction, it is natural to break it in tasks that share some common structure. In terms of the concepts discussed in this article, this common structure would be ϕ, which can be handcrafted based on prior knowledge about the problem or learned through Exp. **10** at the beginning of the agent’s lifetime. Based on the features **9**, which immediately yields a solution

## Conclusion

We have generalized two fundamental operations in RL, policy improvement and policy evaluation, from single to multiple operands (tasks and policies, respectively). The resulting operations, GPE and GPI, can be used to speed up the solution of an RL problem. We showed possible ways to efficiently implement GPE and GPI and discussed how their combination leads to a generalized policy

Many extensions of the basic framework presented in this article are possible. As mentioned, Barreto et al. (31) have proposed a way of implementing a generalized policy **8** with

All of these extensions expand the framework built upon GPE and GPI. Together, they provide a conceptual toolbox that allows one to decompose an RL problem into tasks whose solutions inform each other. This results in a divide-and-conquer approach to RL, which, combined with deep learning, has the potential to scale up our agents to problems currently out of reach.

## Data Availability.

The source code used to generate all of the data associated with this article has been deposited in GitHub (https://github.com/deepmind/deepmind-research/tree/master/option_keyboard/gpe_gpi_experiments).

## Acknowledgments

We thank Joseph Modayil, Pablo Sprechmann, and the anonymous reviewer for their invaluable comments regarding this article.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: andrebarreto{at}google.com.

Author contributions: A.B., D.B., D.S., and D.P. designed research; A.B., S.H., and D.B. performed research; A.B., S.H., and D.B. analyzed data; and A.B. wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019 colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete program and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/science-of-deep-learning.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1907370117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- R. S. Sutton,
- A. G. Barto

- ↵
- J. Kober,
- J. A. Bagnell,
- J. Peters

- ↵
- W. Schultz,
- P. Dayan,
- P. R. Montague

- ↵
- ↵
- P. M. Pilarski et al.

- ↵
- ↵
- M. Bowling,
- N. Burch,
- M. Johanson,
- O. Tammelin

- ↵
- J. Randløv,
- P. Alstrøm

- ↵
- A. Y. Ng,
- H. J. Kim,
- M. I. Jordan,
- S. Sastry

- ↵
- ↵
- ↵
- ↵
- D. Silver et al.

- ↵
- P. Tsividis,
- T. Pouncy,
- J. Xu,
- J. Tenenbaum,
- S. Gershman

- ↵
- B. F. Skinner

- ↵
- C. L. Hull

- ↵
- A. Ng,
- S. Russell

- ↵
- M. L. Puterman

- ↵
- R. E. Bellman

- ↵
- D. P. Bertsekas,
- J. N. Tsitsiklis

- ↵
- R. Munos

- ↵
- ↵
- ↵
- R. Howard

- ↵
- ↵
- R. S. Sutton et al.

- ↵
- T. Schaul,
- D. Horgan,
- K. Gregor,
- D. Silver

- ↵
- A. Barreto et al.

- ↵
- ↵
- A. Barreto et al.

- ↵
- A. Barreto et al.

- ↵
- ↵
- I. Goodfellow,
- Y. Bengio,
- A. Courville

- ↵
- ↵
- ↵
- S. Thrun

- ↵
- D. Borsa et al.

- ↵
- J. Hunt,
- A. Barreto,
- T. Lillicrap,
- N. Heess

- ↵
- S. Hansen et al.

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Computer Sciences