{"id":146,"date":"2010-07-07T15:42:36","date_gmt":"2010-07-07T21:42:36","guid":{"rendered":"http:\/\/fs-s-wpmu-02.facsci.ualberta.ca\/rlai\/?page_id=146"},"modified":"2010-07-07T15:42:36","modified_gmt":"2010-07-07T21:42:36","slug":"rl-faq","status":"publish","type":"page","link":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/resources\/rl-faq\/","title":{"rendered":"RL FAQ"},"content":{"rendered":"<p>Rich Sutton has prepared a FAQ about reinforcement learning &#8212; how to think\u00a0about it and how do it successfully in practice.  This FAQ is an\u00a0attempt to pull together in one place some of the more common\u00a0questions and answers.<\/p>\n<p>&#8220;I have been free with my opinions, but I\u00a0would also welcome the opinions of others for inclusion here.  In you\u00a0have anything to add to my comments, including whole new questions\u00a0or totally different answers, please send me a note at\u00a0rich@richsutton.com.&#8221; &#8211; Rich Sutton<\/p>\n<p><a href=\"#General%20Questions\">General Questions<\/a><\/p>\n<ul>\n<li><a href=\"#What%20is%20RL\">What is Reinforcement Learning?<\/a><\/li>\n<li><a href=\"#just%20trial-and-error\">Is RL just trial-and-error learning, or does it include planning?<\/a><\/li>\n<li><a href=\"#RL%20and%20NDP\">How does RL relate to Neuro-Dynamic Programming?<\/a><\/li>\n<li><a href=\"#Operations%20Research\">What advantages does RL offer on Operations Research problems?<\/a><\/li>\n<li><a href=\"#brain\">How does RL relate to Neuroscience?<\/a><\/li>\n<li><a href=\"#animal%20behavior\">How does RL relate to the psychology of animal behavior?<\/a><\/li>\n<li><a href=\"#behaviorism\">How does RL relate to behaviorism?<\/a><\/li>\n<li><a href=\"#GAs\">How does RL relate to genetic algorithms?<\/a><\/li>\n<li><a href=\"#Who%20invented%20RL\">Who invented RL?<\/a><\/li>\n<li><a href=\"#applications\">What are some of the more significant applications of RL?<\/a><\/li>\n<li><a href=\"#Why don't you include genetic algorithms in RL\">Why don&#8217;t you include genetic algorithms in RL?<\/a><\/li>\n<\/ul>\n<p><a href=\"#What%20is%20RL\"><\/a><\/p>\n<p><a href=\"#just%20trial-and-error\"><\/a><\/p>\n<p><a href=\"#RL%20and%20NDP\"><\/a><\/p>\n<p><a href=\"#Operations%20Research\"><\/a><\/p>\n<p><a href=\"#brain\"><\/a><\/p>\n<p><a href=\"#animal%20behavior\"><\/a><\/p>\n<p><a href=\"#behaviorism\"><\/a><\/p>\n<p><a href=\"#GAs\"><\/a><\/p>\n<p><a href=\"#applications\"><\/a><\/p>\n<p><a href=\"#Why don't you include genetic algorithms in RL\"><\/a><a href=\"#studying%20and%20teaching%20RL\">Questions about studying and teaching RL<\/a><\/p>\n<ul>\n<li><a href=\"#What%20sources\">What sources do you recommend for an introduction to RL?<\/a><\/li>\n<li><a href=\"#demos of RL\">Where can I find some demos of RL?<\/a><\/li>\n<li><a href=\"#get started in RL research\">How can I get started in RL research?<\/a><\/li>\n<li><a href=\"#open problems\">What are some important open problems in RL?<\/a><\/li>\n<\/ul>\n<p><a href=\"#textbook\">Questions about the RL textbook by Sutton and Barto<\/a>?<\/p>\n<p><a href=\"#studying%20and%20teaching%20RL\"><\/a><\/p>\n<p><a href=\"#What%20sources\"><\/a><\/p>\n<p><a href=\"#get started in RL research\"><\/a><\/p>\n<p><a href=\"#textbook\"><\/a><\/p>\n<ul>\n<li><a href=\"#online%20book\">Where can I find an online version of the Sutton and Barto textbook?<\/a><\/li>\n<li><a href=\"#postscript%20book\"> Where can I find an electronic version of the book suitable for printing?<\/a><\/li>\n<li><a href=\"#answers\"> Where can I find the answers to the exercises?<\/a><\/li>\n<li><a href=\"#error%20in%20the%20book\"> I have found an apparent error in the book.  What should I do about it?<\/a><\/li>\n<li><a href=\"#evaluation%20copy\"> How can I obtain an evaluation copy of the book?<\/a><\/li>\n<li><a href=\"#teaching materials\">Where can I obtain other materials for teaching RL?<\/a><a href=\"#NutsBolts\"><\/a><\/li>\n<\/ul>\n<p><a href=\"#NutsBolts\">Nuts and Bolts of RL<\/a><\/p>\n<p><a href=\"#NutsBolts\"><\/a><\/p>\n<ul>\n<li><a href=\"#huge%20spaces\">My state and\/or action space is huge!  Can I still apply RL?<\/a><\/li>\n<\/ul>\n<ul>\n<li><a href=\"#continuous%20actions\">Most RL work assumes the action space is discrete; what about continuous actions?<\/a><\/li>\n<li><a href=\"#continuous%20time\">What about continuous time?<\/a><\/li>\n<li><a href=\"#curse%20of%20dimensionality\">What is the curse of dimensionality?<\/a><\/li>\n<li><a href=\"#tile-coding%20just%20grids\">Isn&#8217;t tile-coding just grids, and thus subject to the<\/a><\/li>\n<li><a href=\"#tile-coding%20just%20grids\">curse of dimensionality?<\/a><\/li>\n<\/ul>\n<ul>\n<li><a href=\"#CMACs\"> Why do you call it &#8220;tile coding&#8221; instead of &#8220;CMACs&#8221;?<\/a><\/li>\n<li><a href=\"#Are%20RL%20methods%20stable%20with%20function%20approximation\"> Are RL methods stable with function approximation?<\/a><\/li>\n<li><a href=\"#tile-coding choices\">In tile coding, how do you pick the state variables to use,<br \/>\nwhat groups of them to treat conjunctively, how many tilings, the<br \/>\ntile widths, etc?  Do we have to make all these kind of representational<br \/>\ndecisions ourselves, as system designers, or are there ways of<br \/>\nautomating these decisions?<\/a><\/li>\n<li><a href=\"#actor-critic\">What are <em>actor-critic<\/em> RL systems?<\/a><\/li>\n<li><a href=\"#backpropagation through time\">What about backpropagation through time?<\/a><a href=\"#Advice%20and%20Opinions\"><\/a><\/li>\n<\/ul>\n<p><a href=\"#Advice%20and%20Opinions\">Advice and Opinions<\/a><\/p>\n<p><a href=\"#Advice%20and%20Opinions\"><\/a><\/p>\n<ul>\n<li><a href=\"#backpropagation\">I am doing RL with a backpropagation neural network and it doesn&#8217;t work;<\/a><\/li>\n<\/ul>\n<ul>\n<li><a href=\"#backpropagation\">what should I do<\/a>?<\/li>\n<\/ul>\n<ul>\n<li><a href=\"#function approximator\"> What kind of function approximator do you recommend?<\/a><\/li>\n<li><a href=\"#best%20algorithm\"> What is the best algorithm to use?<\/a><\/li>\n<li><a href=\"#why linear\">How can you recommend linear function approximation when we all know<br \/>\nthese are limited.  Why not use the powerful nonlinear function<br \/>\napproximators such as neural networks, decision trees, and so forth<br \/>\nthat are so popular in supervised learning?<\/a><\/li>\n<li><a href=\"#Why not instance-based\">What about instance-based function approximation methods such as<\/a> <a href=\"#Why not instance-based\">nearest neighbor and local linear regression?  Why don&#8217;t you<\/a> <a href=\"#Why not instance-based\">recommend them?<\/a><\/li>\n<li><a href=\"#Q-value\">Why do you disparage the term Q-value?<\/a><\/li>\n<\/ul>\n<p><a href=\"#Miscellaneous\">Miscellaneous Questions<\/a><\/p>\n<ul>\n<li><a href=\"#project%20help\">I am working on a project due soon. Can you help me?<\/a><\/li>\n<li><a href=\"#code%20trouble\">I am having trouble with your code.  Can you help me<\/a> <a href=\"#code%20trouble\">get it to work?<\/a><\/li>\n<\/ul>\n<hr \/>\n<p><a name=\"General Questions\"><\/a><\/p>\n<p><strong>General Questions<\/strong><\/p>\n<hr \/>\n<p><a name=\"What is RL\"><\/a><\/p>\n<p><strong>What is Reinforcement Learning?<\/strong><\/p>\n<p style=\"text-align: left\">Reinforcement learning (RL) is learning from interaction with an\u00a0environment, from the consequences of action, rather than\u00a0from explicit teaching.  RL become popular in the 1990s within machine learning and artificial intelligence,but also within operations research and with offshoots in psychology and neuroscience.<\/p>\n<p>Most RL research is conducted within the mathematical framework of\u00a0Markov decision processes (MDPs).  MDPs involve a decision-making agent interacting with its environment so as to maximize the cumulative reward it receives over time. The agent perceives aspects of the environment&#8217;s state and selects actions.  The agent may estimate a value function and use it to construct better and better decision-making policies over time.<\/p>\n<p>RL algorithms are methods for solving this kind of problem, that is, problems involving sequences of decisions in which each decision affects what opportunities are available later, in which the effects need not be\u00a0deterministic, and in which there are long-term goals. RL methods are intended to address the kind of learning and decision making problems that people and animals face in their normal, everyday lives.<\/p>\n<p>For more information, see the <a href=\"#What%20sources\">sources recommended for an introduction to RL<\/a>.<\/p>\n<hr \/>\n<p><a name=\"just trial-and-error\"><\/a><\/p>\n<p><strong>Is RL just trial-and-error learning, or does it include planning?<\/strong><\/p>\n<p>Modern reinforcement learning concerns both trial-and-error learning\u00a0without a model of the environment, and deliberative planning with a\u00a0model. By &#8220;a model&#8221; here we mean a model of the dynamics of the\u00a0environment. In the simplest case, this means just an estimate of the\u00a0state-transition probabilities and expected immediate rewards of the\u00a0environment.  In general it means any predictions about the\u00a0environment&#8217;s future behavior conditional on the agent&#8217;s behavior.<\/p>\n<hr \/>\n<p><a name=\"RL and NDP\"><\/a><\/p>\n<p><strong>How does RL relate to Neuro-Dynamic Programming?<\/strong><\/p>\n<p>To a first approximation, Reinforcement Learning and Neuro-Dynamic\u00a0Programming are synonomous.<br \/>\nThe name &#8220;reinforcement learning&#8221; came from psychology (although\u00a0psychologists rarely use exactly this term) and dates back to the eary days of cybernetics.  For example, Marvin Minsky used this term in his 1954 thesis, and Barto and Sutton revived it in the early 1980&#8217;s.  The name &#8220;neuro-dynamic programming&#8221; was coined by Bertsekas and Tsitsiklis in 1996 to capture the idea of the field as a combination of neural networks and dynamic programming.<\/p>\n<p>In fact, neither name is very descriptive of the subject, and I\u00a0recommend you use neither when you want to be technically precise.\u00a0Names such as this are useful when referring to a general<br \/>\nbody of research, but not for carefully distinguishing ideas\u00a0from one another.  In that sense, there is no point in trying\u00a0to draw a careful distinction between the referents of these two names.<\/p>\n<p>The problem with &#8220;reinforcement learning&#8221; is that it is dated.  Much\u00a0of the field does not concern learning at all, but just planning\u00a0from complete knowledge of the problem (a model of the environment).\u00a0Almost the same methods are used for planning and learning, and it seems\u00a0off-point to emphasize learning in the name of the field.\u00a0&#8220;Reinforcement&#8221; does not seem particularly pertinent either.<\/p>\n<p>The problem with &#8220;neuro-dynamic programming&#8221; is similar in that neither\u00a0neural networks nor dynamic programming are critical to\u00a0the field.  Many of the methods, such as &#8220;Monte Carlo&#8221;, or &#8220;rollout&#8221; methods,\u00a0are completely unrelated to dynamic programming, and neural networks\u00a0are just one choice among many for method of function approximation.\u00a0Moreover, one could argue that the component names, &#8220;neural networks&#8221;\u00a0and &#8220;dynamic programming&#8221;, are each not very descriptive of their respective\u00a0methods.<\/p>\n<hr \/>\n<p><a name=\"Operations Research\"><\/a><\/p>\n<p><strong>What advantages does RL offer in Operations Research problems?<\/strong><\/p>\n<p>Using function approximation, RL can apply to much larger\u00a0state spaces than classical sequential optimization techniques such\u00a0as dynamic programming.  In addition, using simulations (sampling), RL\u00a0can apply to systems that are too large or complicated to explicitly\u00a0enumerate the next-state transition probabilities.<\/p>\n<hr \/>\n<p><a name=\"brain\"><\/a><\/p>\n<p><strong>How does RL relate to Neuroscience?<\/strong><\/p>\n<p>Ideally, the ideas of reinforcement learning could constitute part of a\u00a0<em>computational theory<\/em> of what the brain is doing and why.  A number\u00a0of links have been drawn between reinforcement learning and neuroscience,\u00a0beginning with early models of classical conditioning based on\u00a0temporal-difference learning (see\u00a0<a href=\"publications.html#sutton-barto-82\">Barto and Sutton, 1982<\/a>;\u00a0<a href=\"publications.html#sutton-barto-81-PsychRev\">Sutton and Barto, 1981<\/a>,\u00a0<a href=\"publications.html#sutton-barto-90\">1990<\/a>;\u00a0<a href=\"publications.html#Moore-et-al\">Moore et al., 1986<\/a>),<br \/>\nand continuing through work on foraging\u00a0and prediction learning (see\u00a0<a href=\"http:\/\/www.gatsby.ucl.ac.uk\/%7Edayan\/papers\/mdps95.html\">Montague et al., 1995<\/a>,\u00a0<a href=\"http:\/\/www.gatsby.ucl.ac.uk\/%7Edayan\/papers\/mds96.html\">1996<\/a>),\u00a0and on dopamine neurons in monkeys as a temporal-difference-error\u00a0distribution system.  A good\u00a0<a href=\"http:\/\/www.gatsby.ucl.ac.uk\/%7Edayan\/papers\/sdm97.html\">survey paper<\/a> is available.\u00a0See also <a href=\"http:\/\/www.sloan.salk.edu\/%7Esuri\/#td\">Suri, submitted<\/a>. \u00a0A\u00a0<a href=\"http:\/\/www.amazon.com\/exec\/obidos\/ASIN\/0444819312\/qid=996981931\/sr=1-4\/ref=sc_b_4\/103-8914416-7879060\">book<\/a> collects a number of\u00a0relevant papers.  Doya has extensively developed\u00a0<a href=\"http:\/\/www.isd.atr.co.jp\/%7Edoya\/papers_new01.html#basalganglia\">RL models of the basal ganglia<\/a>.\u00a0Many of these areas are very active at present and changing rapidly.<\/p>\n<hr \/>\n<p><a name=\"animal behavior\"><\/a><\/p>\n<p><strong>How does RL relate to the psychology of animal behavior?<\/strong><\/p>\n<p>Broadly speaking, RL works as a pretty good model of instrumental learning,\u00a0though a detailed argument for this has never been publically made (the closest to\u00a0this is probably <a href=\"publications.html#BSW\">Barto, Sutton and Watkins, 1990<\/a>). \u00a0On the other hand, the links between classical (or Pavlovian) conditioning and\u00a0temporal-difference (TD) learning (one of the central elements of RL) are close and\u00a0widely acknowledged (see <a href=\"publications.html#sutton-barto-90\">Sutton and Barto, 1990<\/a>). \u00a0<a href=\"http:\/\/www.cecs.missouri.edu\/%7Ersun\">Ron Sun<\/a> has developed hybrid models combining high-level and low-level\u00a0skill learning, based in part on RL, which make contact with\u00a0psychological data (see \u00a0<a href=\"http:\/\/www.cecs.missouri.edu\/%7Ersun\/sun.cs01.pdf\">Sun, Merrill, and Peterson, 2001<\/a>).<\/p>\n<hr \/>\n<p><a name=\"behaviorism\"><\/a><\/p>\n<p><strong>How does RL relate to behaviorism?<\/strong><\/p>\n<p>Formally, RL is unrelated to behaviorism, or at least to the aspects\u00a0of behaviorism that are widely viewed as undesireable.  Behaviorism\u00a0has been disparaged for focusing exclusively on behavior, refusing to consider\u00a0what was going on inside the head of the subject.  RL of course is all about the\u00a0algorithms and processes going on inside the agent.  For example, we often consider<br \/>\nthe construction of internal models of the environment within the agent, which is\u00a0far outside the scope of behaviorism.<\/p>\n<p>Nevertheless, RL shares with behaviorism its origins in animal learning theory, and\u00a0in its focus on the interface with the environment.  RL&#8217;s states and actions are\u00a0essentially animal learning theory&#8217;s stimuli and responses.  Part of RL&#8217;s point is\u00a0that these are the essential common final path for all that goes on in the agent&#8217;s\u00a0head.  In the end it all comes down to the actions taken and the states perceived.<\/p>\n<hr \/>\n<p><a name=\"applications\"><\/a><\/p>\n<p><strong>What are some of the more significant applications of RL?<\/strong><\/p>\n<hr \/>\n<p><a name=\"GAs\"><\/a><\/p>\n<p><strong>How does RL relate to genetic algorithms?<\/strong><\/p>\n<p>Most work with genetic algorithms simulates evolution, not learning during\u00a0an individual&#8217;s life, and because of this is very different from work in RL.  That having\u00a0been said, there are two provisos.  First, there is a large body of work on\u00a0classifier systems that uses or is closely related to genetic algorithms.\u00a0This work is concerned with learning during a single agent&#8217;s lifetime\u00a0(using GAs to organize the components of the agent&#8217;s mind) and is in fact\u00a0RL research.  The second proviso is that GA work is\u00a0often related to RL by virtue of being applied to the same <em>problems<\/em>.\u00a0For example, GA methods can be applied to <em>evolve<\/em> a backgammon player\u00a0and that player can be compared with a player learned by RL methods.  In\u00a0fact, a large portion of systems evolved by GAs are controllers that could\u00a0alternatively be learned by RL methods.  It is tempting here to make a\u00a0blanket statement about which class of methods is more appropriate or\u00a0performs better.  A crucial distinction is that between problems in\u00a0which state information is available and those in which it is not.  In\u00a0backgammon, for example, the state of the board is available.  In problems like these RL methods seem to\u00a0have a distinct advantage over evolutionary methods such as GAs.  On other\u00a0problems there is little state information.  Suppose you wish to learn a\u00a0golf swing, or some other ballistic motion that is generally over before\u00a0useful sensory feedback is available.  Then the system is far from Markov and\u00a0there is little or no advantage provided by addressing the state\u00a0information during a single swing.  In these cases RL methods would not be\u00a0expected to have an advantage over evolutionary methods.  In fact, if they\u00a0insisted on trying to treat sensory information as state, as in using\u00a0temporal-difference methods, they may well do worse.<\/p>\n<hr \/>\n<p><a name=\"Why don't you include genetic algorithms in RL\"><\/a><\/p>\n<p><strong>Why don&#8217;t you include genetic algorithms in RL?<\/strong><\/p>\n<hr \/>\n<p><a name=\"Who invented RL\"><\/a><\/p>\n<p><strong>Who invented RL?<\/strong><\/p>\n<p>There is plenty of blame here to go around.\u00a0Minsky, Bellman, Howard, Andreae, Werbos, Barto, Sutton, Watkins&#8230; All played\u00a0important or early roles.  See the\u00a0<a href=\"book\/1\/node7.html#SECTION00160000000000000000\">history from the Sutton and Barto text<\/a>.<\/p>\n<hr \/>\n<p><a name=\"studying and teaching RL\"><\/a><\/p>\n<p><strong>Questions about studying and teaching RL<\/strong><\/p>\n<hr \/>\n<p><a name=\"What sources\"><\/a><\/p>\n<p><strong>What sources do you recommend for an introduction to <\/strong><strong>reinforcement learning?<\/strong><\/p>\n<p>For a general introduction, I recommend my book with Prof. Barto:\u00a0Reinforcement Learning: An Introduction,\u00a0by Richard S. Sutton and Andrew G. Barto.  MIT Press 1998.\u00a0<a href=\"book\/the-book.html\">Online version<\/a>.\u00a0There is also a Japanese translation available. \u00a0For a more formal treatment, including rigorous proofs, I recommend the\u00a0text by Bertsekas and Tsitsiklis:\u00a0<a href=\"http:\/\/www.amazon.com\/exec\/obidos\/ASIN\/1886529108\/ref=sim_books\/002-2252285-1455243\">Neuro-dynamic Programming<\/a>,\u00a0by Dimitri P. Bersekas and John N. Tsitsiklis.\u00a0Athena Press, 1996. \u00a0If you don&#8217;t have time for a textbook-length treatment, your best\u00a0bet is one or both of these two papers:\u00a0<a href=\"http:\/\/www.jair.org\/abstracts\/kaelbling96a.html\">Reinforcement learning: A survey<\/a>,  by Kaelbling, L.P., Littman, M.L.,\u00a0and Moore, A.W., in the <em>Journal of Artificial Intelligence <\/em><em>Research<\/em>, 4:237&#8211;285, 1996. \u00a0<a href=\"publications.html#BSW\">Learning and sequential decision making<\/a>, by\u00a0Barto, A.G., Sutton, R.S., &amp; Watkins, C.J.C.H.,\u00a0in <em>Learning and Computational Neuroscience<\/em>,\u00a0M. Gabriel and J.W. Moore (Eds.), pp. 539&#8211;602, 1990, MIT Press.<\/p>\n<hr \/>\n<p><a name=\"textbook\"><\/a><\/p>\n<p><strong>Questions about the RL textbook by Sutton and Barto<\/strong><\/p>\n<hr \/>\n<p><a name=\"online book\"><\/a><\/p>\n<p><strong>Where can I find an online version of the Sutton and Barto book?<\/strong><\/p>\n<p><a href=\"book\/the-book.html\">http:\/\/richsutton.com\/book\/the-book.html<\/a><\/p>\n<p>There is also a lot of other material there: code, figures, errata,some teaching materials.<\/p>\n<hr \/>\n<p><a name=\"postscript book\"><\/a><\/p>\n<p><strong>Where can I find an electronic version of the book suitable for printing?<\/strong><\/p>\n<p>Pdfs of pre-publication versions of Chapters 1, 3, and 9 (only) are available at\u00a0<a href=\"book\/the-book.html\">http:\/\/richsutton.com\/book\/the-book.html<\/a><\/p>\n<hr \/>\n<p><a name=\"answers\"><\/a><\/p>\n<p><strong>Where can I find the answers to the exercises?<\/strong><\/p>\n<p>An instructor&#8217;s manual containing answers to all the non-programming\u00a0exercises is available to qualified teachers.  Readers using the book for self study can obtain answers on a\u00a0chapter-by-chapter basis after working on the exercises themselves. \u00a0<a href=\"book\/solutions.html\">For details see here<\/a>.<\/p>\n<hr \/>\n<p><a name=\"error in the book\"><\/a><\/p>\n<p><strong>I have found an apparent error in the book.  What should I do about it?<\/strong><\/p>\n<p>First, please check the <a href=\"book\/errata.html\">list of errata<\/a>.  If you don&#8217;t find your error\u00a0there, then please send me a note describing it at rich@richsutton.com. \u00a0If I agree that it is an\u00a0error, then I will add it the the errata list.  For your troubles you will\u00a0be credited in the list for the discovery of the error.<\/p>\n<hr \/>\n<p><a name=\"demos of RL\"><\/a><\/p>\n<p><strong>Where can I find some demos of RL?<\/strong><\/p>\n<hr \/>\n<p><a name=\"open problems\"><\/a><\/p>\n<p><strong>What are some important open problems in RL?<\/strong><\/p>\n<hr \/>\n<p><a name=\"evaluation copy\"><\/a><\/p>\n<p><strong>How can I obtain an evaluation copy of the book?<\/strong><\/p>\n<p>Qualified instructor can generally obtain an examination copy from\u00a0MIT press.  Please see the \u00a0<a href=\"http:\/\/mitpress.mit.edu\/book-home.tcl?isbn=0262193981\">MIT Press home page for the book<\/a>.<\/p>\n<hr \/>\n<p><a name=\"teaching materials\"><\/a><\/p>\n<p><strong>Where can I obtain other materials for teaching RL?<\/strong><\/p>\n<hr \/>\n<p><a name=\"get started in RL research\"><\/a><\/p>\n<p><strong>How can I get started in reinforcement learning research?<\/strong><\/p>\n<hr \/>\n<p><a name=\"NutsBolts\"><\/a><\/p>\n<p><strong>Nuts and Bolts of RL<\/strong><\/p>\n<hr \/>\n<p><a name=\"huge spaces\"><\/a><\/p>\n<h3>My state and\/or action space is huge!  Can I still apply RL?<\/h3>\n<p>Yes, but you can&#8217;t get by with simple tables; you\u00a0will need some kind of function approximation.<\/p>\n<p>&#8220;Function approximation&#8221; refers to the use of a parameterized functional\u00a0form to represent the value function (and\/or the policy), as opposed to a\u00a0simple table.  A table is able to represent the value of each state\u00a0separately, without confusion, interaction, or generalization with the\u00a0value of any other state.  In typical problems, however, there are far too\u00a0many states to learn or represent their values individually; instead we\u00a0have to generalize from observed to states to new, unobserved ones.  In\u00a0principle, this need not be a problem.  There are a host of supervised\u00a0learning methods that can used to approximate functions.  However,\u00a0there are both theoretical and practical pitfalls, and some care is needed.\u00a0See Chapter 8 of the Sutton and Barto text.\u00a0For the most part, the theoretical foundation that RL adopts from dynamic\u00a0programming is no longer valid in the case of function approximation.  For\u00a0example, Q-learning with linear function approximation is known to be\u00a0unsound.  The strongest positive result is for on-policy prediction with\u00a0linear function approximators (<a href=\"http:\/\/web.mit.edu\/jnt\/www\/Papers\/td.ps\">Tsitsiklis and Van Roy, 1997<\/a>; Tadic, 2001).\u00a0This is an area of active current research (e.g., see\u00a0<a href=\"http:\/\/nips.djvuzone.org\/djvu\/nips13\/Gordon.djvu\">Gordon, 2001<\/a>;\u00a0<a href=\"publications.html#offpolicy\">Precup, Sutton &amp; Dasgupta, 2001<\/a>).<\/p>\n<hr \/>\n<p><a name=\"continuous actions\"><\/a><\/p>\n<h3>Most RL work assumes the action space is discrete; what about continuous actions?<\/h3>\n<p>It is true that most RL work has considered discrete action spaces, but this was\u00a0usually done for convenience, not as an essential limitation of the ideas; and\u00a0there are exceptions.  Nevertheless, it is often not obvious how to extend RL\u00a0methods to continuous, or even large discrete, action spaces.  The key problem is\u00a0that RL methods typically involve a max  or sum over elements of the action space,\u00a0which is not feasible if the space is large or infinite.  The natural approach is to\u00a0replace the enumeration of actions with a sample of them, and average (just as we\u00a0replace the enumeration of possible next states with a sample of the, and average).\u00a0This requires either a very special structure for the action-value function, or else\u00a0a stored representation of the best known policy.  Actor-critic methods are one\u00a0approach.<\/p>\n<p>With no attempt to be exhaustive, some of the earlier\u00a0RL research with continuous actions includes:<\/p>\n<ul>\n<li>Williams, R.J. (1992).<br \/>\nSimple statistical gradient-following algorithms for connectionist\u00a0reinforcement learning.  <em>Machine Learning<\/em>, 8:229&#8211;256.<\/li>\n<li>Millington, P.J. (1991).<br \/>\nAssociative Reinforcement Learning for Optimal Control,\u00a0M.S. Thesis, Massachusetts Institute of Technology,\u00a0Technical Report CSDL-T-1070.<\/li>\n<li>Baird, L. C., &amp; Klopf, A. H. (1993).<br \/>\n<a href=\"http:\/\/www.leemon.com\/papers\/wirefit\/wirefit.html#RTFToC1\"><br \/>\nReinforcement learning with high-dimensional, continuous actions<\/a>.<br \/>\nTechnical Report WL-TR-93-1147.\u00a0Wright-Patterson Air Force Base Ohio: Wright Laboratory.<\/li>\n<li>Santamaria, J.C., Sutton, R.S., Ram, A. (1998).<br \/>\n<a href=\"publications.html#santamaria\"><br \/>\nExperiments with reinforcement learning in problems with<br \/>\ncontinuous state and action spaces<\/a>,  Adaptive Behavior 6(2):163-218.<\/li>\n<li>Yamada, S., Nakashima, M., and Shiono, S. (1998).<br \/>\nReinforcement learning to train a cooperative network with both\u00a0discrete and continuous output neurons,<br \/>\nIEEE Trans. on Neural Networks 9(6):1502-1508.<\/li>\n<\/ul>\n<p>See also:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.dai.ed.ac.uk\/homes\/andys\/PAGE_PAPERS\/papers.html\"> The papers and thesis of Andrew Smith<\/a><\/li>\n<\/ul>\n<p>I would be glad to include new or other work in this list as well.\u00a0Please send me pointers!<\/p>\n<hr \/>\n<p><a name=\"actor-critic\"><\/a><\/p>\n<h3>What are <em>actor-critic<\/em> and <em>policy-gradient<\/em> RL systems?<\/h3>\n<p>These are systems that explicitly store both an approximate vaue\u00a0function (critic) and an approximate policy (actor).  These were\u00a0popular in the early &#8217;80s, then became less important with the\u00a0development of methods such as Q-learning and Sarsa that explicitly\u00a0represent only an action-value function, and compute their policy\u00a0from the value function.  Recently interest in actor-critic methods\u00a0has revived because they appear not to have some of the convergence\u00a0problems of value-function-based methods.<\/p>\n<hr \/>\n<p><a name=\"backpropagation through time\"><\/a><\/p>\n<h3>What about backpropagation through time?<\/h3>\n<hr \/>\n<p><a name=\"continuous time\"><\/a><\/p>\n<h3>What about continuous time?<\/h3>\n<p>Although the ideas of RL extend naturally, in principle, to continuous time, there\u00a0has been relatively little work here.  The best work on this so far is probably that by\u00a0<a href=\"http:\/\/www.isd.atr.co.jp\/%7Edoya\/papers_new01.html#reinforcement\">Kenji Doya<\/a>.<\/p>\n<hr \/>\n<p><a name=\"curse of dimensionality\"><\/a><\/p>\n<h3>What is the curse of dimensionality?<\/h3>\n<p>The curse of dimensionality refers to the tendency of a state space to\u00a0grow exponentially in its dimension, that is, in the number of state\u00a0variables.  This is of course a problem for methods such as dynamic\u00a0programming and other table-based RL methods whose complexity scales\u00a0linearly with the number of states.  Many RL methods are able to\u00a0partially escape the curse by sampling and by function approximation. \u00a0Curiously, the key step in the tile-coding approach to function\u00a0approximation is expanding the original state representation into a vastly\u00a0<em>higher dimensional<\/em> space.  This makes many complex nonlinear\u00a0relationships in the original representation simpler and linear in the\u00a0expanded representation.  The more dimensions in the expanded\u00a0representation, the more functions can be learned easily and quickly.  I\u00a0call this the <em>blessing of dimensionality<\/em>.  It is one of the ideas\u00a0behind modern  support vector machines, but in fact it goes back at\u00a0least as far as the perceptron.<\/p>\n<hr \/>\n<p><a name=\"tile-coding just grids\"><\/a><\/p>\n<h3>Isn&#8217;t tile-coding just grids, and thus subject to the<br \/>\ncurse of dimensionality?<\/h3>\n<p>Basically, no.  Tile coding is a quite general idea and many of the ways it\u00a0can be used avoid the curse of dimensionality.  There are at least threecommon tricks: 1) you can consider subsets of the state variables in\u00a0separate tilings, 2) you can use overlapping tilings to get vastly higher\u00a0resolution in high dimensional spaces than would be possible with a simple\u00a0grid using the same amount of memory, and 3) you can hash down to a smaller\u00a0space so as only to devote memory to the portions of state space that\u00a0actually occur.<\/p>\n<hr \/>\n<p><a name=\"CMACs\"><\/a><\/p>\n<h3>Why do you call it &#8220;tile coding&#8221; instead of the conventional name, &#8220;CMACs&#8221;?<\/h3>\n<p>The idea of tile coding is essentially the same as that underlying CMACs.\u00a0Without in any way intending to detract\u00a0from the credit due the pioneers of CMACs (e.g, Albus, Marr, Miller&#8230;),\u00a0sometimes it is better to switch to a new name.  The name CMAC stands for, among\u00a0other things,\u00a0&#8220;Cerebellar Model Articulatory Controller,&#8221; which seems pretty\u00a0inappropriate for the current usage.  The original CMAC also used a\u00a0slightly different algorithm &#8212; a correlation rule rather than an error\u00a0correction rule.  The use in reinforcement learning steps it even farther\u00a0away from its original use.  What remains is not so much a learning system\u00a0(much less a cerebellar model), but a way of coding states for use by a\u00a0learning system.  The key features of the coding is that it is based on\u00a0multiple exhaustive partitions (tilings!) of the state space, and that it\u00a0is particularly suited for implementation on serial, digital computers.<\/p>\n<hr \/>\n<p><a name=\"Are RL methods stable with function approximation\"><\/a><\/p>\n<h3>Are RL methods stable with function approximation?<\/h3>\n<p>The situation is a bit complicated and in flux at present.\u00a0Stability guarantees depend on the specific algorithm\u00a0and function approximator, and on the way it is used.\u00a0This is what we knew as of August 2001:<\/p>\n<ul>\n<li>For some nonlinear parameterized function approximators (FA), any\u00a0temporal-difference (TD) learning method (including Q-learning\u00a0and Sarsa)\u00a0can become unstable (parameters and estimates going to infinity). \u00a0[Tsitsiklis &amp; Van Roy 1996]<\/li>\n<li>TD(lambda) with linear FA converges near the best linear solution\u00a0when trained on-policy&#8230;\u00a0[<a href=\"http:\/\/web.mit.edu\/jnt\/www\/Papers\/td.ps\">Tsitsiklis &amp; Van Roy 1997<\/a>]&#8230;but may become unstable when trained off-policy (updating states with\u00a0a different distribution than that seen when following the policy).\u00a0[<a href=\"http:\/\/www.leemon.com\/papers\/residual\/res.html#RTFToC1\">Baird 1995<\/a>]\u00a0From which it follows that Q-learning with linear FA can also be unstable.\u00a0[<a href=\"http:\/\/www.leemon.com\/papers\/residual\/res.html#RTFToC1\">Baird 1995<\/a>]<\/li>\n<li>Sarsa(lambda), on the other hand, is guaranteed stable,\u00a0although only the weakest of error bounds has been shown.\u00a0[<a href=\"http:\/\/nips.djvuzone.org\/djvu\/nips13\/Gordon.djvu\">Gordon 2001<\/a>]<\/li>\n<li>New linear TD algorithms for the off-policy case have been shown convergent\u00a0near the best solution.\u00a0[<a href=\"publications.html#offpolicy\">Precup, Sutton &amp; Dasgupta 2001<\/a>]<\/li>\n<\/ul>\n<p>Since then, the new <a href=\"http:\/\/www.mcb.mcgill.ca\/%7Eperkins\/publications\/PerPreNIPS02.pdf\">Perkins and Precup result from  NIPS 2002<\/a> has appeared, which\u00a0may at last have resolved the question positively by proving the convergence of\u00a0Sarsa(0) with linear function approximation and an appropriate exploration regime.<\/p>\n<hr \/>\n<p><a name=\"Advice and Opinions\"><\/a><\/p>\n<h2>Advice and Opinions<\/h2>\n<hr \/>\n<p><a name=\"backpropagation\"><\/a><\/p>\n<h3>I am doing RL with a backpropagation neural network and it doesn&#8217;t work;<br \/>\nwhat should I do?<\/h3>\n<p>It is a common error to use a backpropagation neural network as the\u00a0function approximator in one&#8217;s first experiments with reinforcement\u00a0learning, which almost always leads to an unsatisfying failure.  The\u00a0primary reason for the failure is that backpropation is fairly tricky to\u00a0use effectively, doubly so in an online application like reinforcement\u00a0learning.  It is true that Tesauro used this approach in his strikingly\u00a0successful backgammon application, but note that at the time of his work\u00a0with TDgammon, Tesauro was already an expert in applying backprop networks\u00a0to backgammon.  He had already built the world&#8217;s best computer player of\u00a0backgammon using backprop networks.  He had already learned all the tricks\u00a0and tweaks and parameter settings to make backprop networks learn well.\u00a0Unless you have a similarly extensive background of experience, you are\u00a0likely to be very frustrated using a backprop network as your\u00a0function approximator in reinforcement learning.<\/p>\n<p>The solution is to step back and accept you can only innovate in one area at a\u00a0time.  First make sure you understand RL ideas in the tabular case, and the general principles of\u00a0function approximation in RL.  Then proceed to a better understood function\u00a0approximator, such as a linear one.  In my own work I never go beyond\u00a0linear function approximators.  I just augment them with larger and larger<br \/>\nfeature vectors, using coarse tile-coding (see e.g., Chapter 8 of the book,\u00a0Sutton, 1996;\u00a0Stone and Sutton, 2001).  It may be that this approach would always be\u00a0superior to backpropagation nets, but that remains to be seen.  But someone\u00a0new to learning and RL algorithms certainly should <em>not<\/em> start with\u00a0backprop nets.<\/p>\n<p>Also see <a href=\"http:\/\/www.cs.ualberta.ca\/%7Esutton\/publications.html#TDlambda_details\">here<\/a> for some of the details\u00a0 for using TD methods with backprop nets.<\/p>\n<hr \/>\n<p><a name=\"function approximator\"><\/a><\/p>\n<h3>What kind of function approximator do you recommend?<\/h3>\n<hr \/>\n<p><a name=\"best algorithm\"><\/a><\/p>\n<h3>What is the best algorithm to use?<\/h3>\n<p>The true answer, of course, is that we don&#8217;t know, and that it probably\u00a0hasn&#8217;t been invented yet.  Each algorithm has strengths and weaknesses, and\u00a0the current favorite changes every few years.  In the 1980s actor-critic\u00a0methods were very popular, but in the 1990s they were largely superceded by\u00a0value-function methods such as Q-learning and Sarsa.  Q-learning is\u00a0probably still the most widely used, but its instability with function\u00a0approximation, discovered in 1995, probably rules it out for the long run.\u00a0Recently policy-based methods such as actor-critic and value-function-less\u00a0methods, including some of those from the 1980s, have become popular again. \u00a0So, it seems we must keep our minds and options open as RL moves forward.<\/p>\n<hr \/>\n<p><a name=\"why linear\"><\/a><\/p>\n<h3>How can you recommend linear function approximation when we all know<br \/>\nthese are limited.  Why not use the powerful nonlinear function<br \/>\napproximators such as neural networks, decision trees, and so forth<br \/>\nthat are so popular in supervised learning?<\/h3>\n<hr \/>\n<p><a name=\"Why not instance-based\"><\/a><\/p>\n<h3>What about instance-based function approximation methods such as<br \/>\nnearest neighbor and local linear regression?  Why don&#8217;t you<br \/>\nrecommend them?<\/h3>\n<hr \/>\n<p><a name=\"tile-coding choices\"><\/a><\/p>\n<h3>Question<\/h3>\n<p>In tile coding, how do you pick the state variables to use,\u00a0what groups of them to treat conjunctively, how many tilings, the\u00a0tile widths, etc?  Do we have to make all these kind of representational\u00a0decisions ourselves, as system designers, or are there ways of\u00a0automating these decisions?<\/p>\n<hr \/>\n<p><a name=\"Q-value\"><\/a><\/p>\n<h3>Why do you disparage the term Q-value?<\/h3>\n<p>RL researchers often talk about Q-values, Q-tables, and Q-functions,\u00a0but to me such usage almost always seems like jargon&#8212;that is, like\u00a0apparently precise technical terms that are off-putting to newcomers\u00a0but really add nothing.  In this case the Q- prefix means only that\u00a0the values, tables, or functions pertain to state-action pairs.\u00a0There are several pertinent functions that take state-action pairs,\u00a0for example, that returning the optimal values, that returning the\u00a0true values for some policy, and those returning the approximation of\u00a0one of these two values.  In most cases, we give these functions\u00a0different notations, all of which involve\u00a0the letter Q.  That is why I consider the name Q-function to be\u00a0apparently technical, but in fact imprecise, and thus undesirable,\u00a0off-putting jargon. If you are just talking in general\u00a0about the value of an action at a state, then the term I prefer is simply\u00a0&#8220;action value&#8221;.<\/p>\n<hr \/>\n<p><a name=\"Miscellaneous\"><\/a><\/p>\n<h2>Miscellaneous Questions<\/h2>\n<hr \/>\n<p><a name=\"project help\"><\/a><\/p>\n<h3>I am working on a project due soon. Can you help me?<\/h3>\n<p>Reinforcement learning is a big topic, with many large branches and\u00a0under-explored side areas.  And there are no cookbook (easy, ready,\u00a0immediate) solutions.  I can sometimes offer advice that may point you in\u00a0the right direction, but it will almost always take time, lots of your time,\u00a0to study and come to understand the issues before you will see the\u00a0appropriate solution.  If you\u00a0are running up against a deadline, there is little that I can do.<\/p>\n<hr \/>\n<p><a name=\"code trouble\"><\/a><\/p>\n<h3>I am having trouble with your code.  Can you help me<br \/>\nget it to work?<\/h3>\n<p>This code is made available on an as-is basis, and in many cases I\u00a0can provide little or no support.  In many cases, I have no personal\u00a0experience with the code, and in others it will not even run outside of\u00a0my special computer environments.  I make the code available\u00a0nevertheless, because reading it may clarify some of the\u00a0implementation points for you.  In a few cases, like the tile\u00a0coding implementations in C++, I have written the code myself to run\u00a0universally, and I can provide some support.<\/p>\n<p>Maybe you could contribute some of your own code, perhaps a class project,\u00a0that is better documented and works on a wider range of computational\u00a0platforms.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rich Sutton has prepared a FAQ about reinforcement learning &#8212; how to think\u00a0about it and how do it successfully in practice. This FAQ is an\u00a0attempt to pull together in one place some of the more common\u00a0questions and answers. &#8220;I have been free with my opinions, but I\u00a0would also welcome the opinions of others for inclusion<\/p>\n<p><a href=\"https:\/\/spaces.facsci.ualberta.ca\/rlai\/resources\/rl-faq\/\" class=\"more-link themebutton\">Read More<\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"parent":11,"menu_order":4,"comment_status":"closed","ping_status":"open","template":"","meta":{"footnotes":""},"class_list":["post-146","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/pages\/146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/comments?post=146"}],"version-history":[{"count":0,"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/pages\/146\/revisions"}],"up":[{"embeddable":true,"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/pages\/11"}],"wp:attachment":[{"href":"https:\/\/spaces.facsci.ualberta.ca\/rlai\/wp-json\/wp\/v2\/media?parent=146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}