Gradient Temporal-Difference Learning

In the area of core reinforcement-learning algorithms, there has been continuing progress with our recently introduced gradient temporal-difference (GTD) learning algorithms. Temporal-difference (TD) learning is a core technology at the heart of modern reinforcement learning; a variant of TD learning is used in almost all large-scale applications of reinforcement learning. However, an important limitation of standard algorithms is that they can only be used to learn about a way of behaving if the learning system commits to that way of behaving all the way to its completion. People, by contrast, are able to learn about a way of behaving even if they behave that way for only a limited time. We can start to do something (e.g., drive to work) and then abort it for any reason (e.g., we remember that it’s a holiday) yet still learn from the incomplete trial. Or we might start behaving in a way that is consistent with many different ways of continuing. Of course we can continue only according to one of them (we can only do one thing at a time!), but we might hope to learn about all of them from the initial portion that they share. This is called “off-policy learning” and has been part of the promise of TD learning for decades; with the GTD methods we introduced two years ago it can at last be done in a scalable and efficient way, resolving the most important open problem in reinforcement learning in the last 15 years.

In RLAI Lab we have continued to make progress with off-policy learning by GTD algorithms. We have come to understand the algorithms better and have extended them to new settings. The most important extension has been from a prediction setting (learning the consequences of a given way of behaving) to a control setting (learning an optimal way of behaving). Recently we have introduced a very general and powerful GTD prediction algorithm called GQ(λ). And, we proved that if the way of behaving was continually adjusted in a local “greedy” fashion, then the resulting algorithm, called Greedy-GQ, will converge to an optimal policy. Greedy-GQ is a direct improvement over Q-learning, the most popular off-policy TD control method, enabling off-policy learning on large applications for which Q-learning is inapplicable, inefficient, or unreliable.

The new gradient TD algorithm made it possible to learn many different policies at the same time  (see Horde architecture). There are several main questions to answer: 1) What would be the simplest method to learn the two step-sizes used in GTD?, 2) Is there any simple way to reduce variance in the GTD update? And of course,  conducting large number of experiments to assess the performance of Greedy-GQ in control problems is very beneficial.