You are on page 1of 37

ReinforcementLearning

Slidesfrom
R.S.SuttonandA.G.Barto
ReinforcementLearning:AnIntroduction
http://www.cs.ualberta.ca/~sutton/book/thebook.html
http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse.html

TheAgentEnvironmentInterface

Agentandenvironmentinteractatdiscretetimesteps : t 0,1, 2, K
Agentobservesstateatstep t : st S
producesactionatstep t : at A(st )
getsresultingreward : rt 1
andresultingnextstate : st 1

...

st

at

rt+1

st+1

at+1

rt+2

st+2

at+2

rt+3 s
t+3

...

at+3

TheAgentLearnsaPolicy
Policyatstept, t :
amappingfromstatestoactionprobabilities
t (s, a) probabilitythat at awhenst s
Reinforcementlearningmethodsspecifyhowtheagent
changesitspolicyasaresultofexperience.
Roughly,theagentsgoalistogetasmuchrewardasit
canoverthelongrun.

GettingtheDegreeofAbstractionRight
Timestepsneednotrefertofixedintervalsofrealtime.
Actionscanbelowlevel(e.g.,voltagestomotors),orhigh
level(e.g.,acceptajoboffer),mental(e.g.,shiftinfocus
ofattention),etc.
Statescanbelowlevelsensations,ortheycanbe
abstract,symbolic,basedonmemory,orsubjective(e.g.,
thestateofbeingsurprisedorlost).
AnRLagentisnotlikeawholeanimalorrobot,which
consistofmanyRLagentsaswellasothercomponents.
Theenvironmentisnotnecessarilyunknowntotheagent,
onlyincompletelycontrollable.
Rewardcomputationisintheagentsenvironmentbecause
theagentcannotchangeitarbitrarily.

GoalsandRewards
Isascalarrewardsignalanadequatenotionofagoal?
maybenot,butitissurprisinglyflexible.
Agoalshouldspecifywhatwewanttoachieve,nothow
wewanttoachieveit.
Agoalmustbeoutsidetheagentsdirectcontrolthus
outsidetheagent.
Theagentmustbeabletomeasuresuccess:
explicitly;
frequentlyduringitslifespan.

Returns
Supposethesequenceofrewardsafterstep tis :
rt 1 , rt 2 , rt 3 , K
Whatdowewanttomaximize?

Ingeneral,
wewanttomaximizethe expectedreturn, ERt ,foreachstept.

Episodictasks:interactionbreaksnaturallyinto
episodes,e.g.,playsofagame,tripsthroughamaze.

Rt rt 1 rt 2 L rT ,
whereTisafinaltimestepatwhichaterminalstateis
reached,endinganepisode.

ReturnsforContinuingTasks
Continuingtasks:interactiondoesnothavenaturalepisodes.
Discountedreturn:

Rt rt 1 rt 2 2 rt 3 L k rt k 1 ,
k 0

where , 0 1,isthediscountrate.
shortsighted0 1farsighted

AnExample
Avoidfailure:thepolefallingbeyond
acriticalangleorthecarthittingendof
track.
Asanepisodictaskwhereepisodeendsuponfailure:
reward 1foreachstepbeforefailure
return numberofstepsbeforefailure

Asacontinuingtaskwithdiscountedreturn:

reward 1uponfailure; 0otherwise


return k ,forkstepsbeforefailure

Ineithercase,returnismaximizedby
avoidingfailureforaslongaspossible.

AnotherExample
Gettothetopofthehill
asquicklyaspossible.

reward 1foreachstepwhere notattopofhill


return numberofstepsbeforereachingtopofhill

Returnismaximizedbyminimizing
numberofstepsreachthetopofthehill.

AUnifiedNotation
Inepisodictasks,wenumberthetimestepsofeach
episodestartingfromzero.
Weusuallydonothavedistinguishbetweenepisodes,so
st
st, j
wewriteinsteadofforthestateatsteptof
episodej.
Thinkofeachepisodeasendinginanabsorbingstatethat
alwaysproducesrewardofzero:

R
Wecancoverallcasesbywriting t rt k 1 ,
k 0

where canbe1onlyifazerorewardabsorbingstateisalwaysreached.

TheMarkovProperty
Bythestateatstept,thebookmeanswhateverinformationis
availabletotheagentatsteptaboutitsenvironment.
Thestatecanincludeimmediatesensations,highlyprocessed
sensations,andstructuresbuiltupovertimefromsequencesof
sensations.
Ideally,astateshouldsummarizepastsensationssoastoretain
allessentialinformation,i.e.,itshouldhavetheMarkov
Property:
Prst 1 s,r
t 1 r st ,at ,rt , st 1 ,at 1 ,K ,r1 ,s0 ,a0

Prst 1 s,r
t 1 r st ,at

forall s, r, andhistoriesst ,at ,rt , st 1 ,at 1 ,K ,r1, s0 ,a0 .

MarkovDecisionProcesses
IfareinforcementlearningtaskhastheMarkovProperty,itis
basicallyaMarkovDecisionProcess(MDP).
Ifstateandactionsetsarefinite,itisafiniteMDP.
TodefineafiniteMDP,youneedtogive:
stateandactionsets

onestepdynamicsdefinedbytransitionprobabilities:
Psas Prst 1 s st s,at aforalls, sS, a A(s).

rewardprobabilities:
Rsas Ert 1 st s,at a,s t 1 sforalls, sS, a A(s).

AnExampleFiniteMDP
RecyclingRobot
Ateachstep,robothastodecidewhetheritshould(1)activelysearchforacan,
(2)waitforsomeonetobringitacan,or(3)gotohomebaseandrecharge.
Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhile
searching,hastoberescued(whichisbad).
Decisionsmadeonbasisofcurrentenergylevel:high,low.
Reward=numberofcanscollected

ValueFunctions
Thevalueofastateistheexpectedreturnstartingfrom
thatstate;dependsontheagentspolicy:
State valuefunctionforpolicy :
k

V (s) E Rt st s E rt k 1 st s
k 0

Thevalueoftakinganactioninastateunderpolicy
istheexpectedreturnstartingfromthatstate,takingthat
action,andthereafterfollowing:
Action valuefunctionforpolicy :
k

Q (s, a) E Rt s t s, at a E rt k 1 s t s,at a
k 0

BellmanEquationforaPolicy
Thebasicidea:

Rt rt 1 rt 2 2 rt 3 3 rt 4 L

rt 1 rt 2 rt 3 rt 4 L
2

rt 1 Rt 1
So:

V (s) E Rt st s

E rt 1 V st 1 st s

Or,withouttheexpectationoperator:
V (s) (s,a) PssRss V ( s)

MoreontheBellmanEquation
V (s) (s,a) PssRss V ( s)

Thisisasetofequations(infact,linear),oneforeachstate.
Thevaluefunctionforisitsuniquesolution.
Backupdiagrams:

forV

forQ

Gridworld
Actions:north,south,east,west;deterministic.
Ifwouldtakeagentoffthegrid:nomovebutreward=1
Otheractionsproducereward=0,exceptactionsthatmove
agentoutofspecialstatesAandBasshown.

Statevaluefunction
forequiprobable
randompolicy;
=0.9

OptimalValueFunctions
ForfiniteMDPs,policiescanbepartiallyordered:


ifandonlyif V (s) V (s)foralls S
Thereisalwaysatleastone(andpossiblymany)policiesthat
isbetterthanorequaltoalltheothers.Thisisanoptimal
policy.Wedenotethemall*.
Optimalpoliciessharethesameoptimalstatevaluefunction:

V (s) max V (s)foralls S

Optimalpoliciesalsosharethesameoptimalactionvalue
function:
Q (s,a) max Q (s, a)foralls Sanda A(s)

Thisistheexpectedreturnfortakingactionainstates
andthereafterfollowinganoptimalpolicy.

BellmanOptimalityEquationforV*
Thevalueofastateunderanoptimalpolicymustequal
theexpectedreturnforthebestactionfromthatstate:

V (s) max Q (s,a)


aA(s)

max Ert 1 V (st 1 ) st s, at a

aA(s)

max PsasRsas V (s)



aA(s)

Therelevantbackupdiagram:

V
istheuniquesolutionofthissystemofnonlinearequations.

BellmanOptimalityEquationforQ*

Q (s,a) E rt 1 max Q (st1 , a)


st s,at a

Psas Rsas max Q ( s,


a)

Therelevantbackupdiagram:

Q
istheuniquesolutionofthissystemofnonlinearequations.

WhyOptimalStateValueFunctionsareUseful

V
Anypolicythatisgreedywithrespecttoisanoptimalpolicy.

V
Therefore,given,onestepaheadsearchproducesthe
longtermoptimalactions.
E.g.,backtothegridworld:

WhatAboutOptimalActionValueFunctions?
*
Q
Given,theagentdoesnoteven

havetodoaonestepaheadsearch:

(s) arg max Q (s,a)


aA (s)

SolvingtheBellmanOptimalityEquation
FindinganoptimalpolicybysolvingtheBellmanOptimality
Equationrequiresthefollowing:

accurateknowledgeofenvironmentdynamics;

wehaveenoughspaceantimetodothecomputation;

theMarkovProperty.
Howmuchspaceandtimedoweneed?

polynomialinnumberofstates(viadynamicprogramming
methods;Chapter4),

BUT,numberofstatesisoftenhuge(e.g.,backgammonhas
about10**20states).
Weusuallyhavetosettleforapproximations.
ManyRLmethodscanbeunderstoodasapproximatelysolvingthe
BellmanOptimalityEquation.

TDPrediction
PolicyEvaluation(thepredictionproblem):

foragivenpolicy,computethestatevaluefunction V

ThesimplestTDmethod, TD(0) :
V(st ) V(st ) rt 1 V (st1 ) V(st )

target:anestimateofthereturn

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

24

SimplestTDMethod
V(st ) V(st ) rt 1 V (st1 ) V(st )

st

rt 1

st 1
TT

TT

T
T

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

TT

T
T

TT

25

Example:DrivingHome
State

ElapsedTime
(minutes)
leavingoffice
0

Predicted
TimetoGo
30

Predicted
TotalTime
30

reachcar,
raining
exithighway

35

40

20

15

35

behindtruck

30

10

40

homestreet

40

43

arrivehome

43

43

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

26

DrivingHome
Changesrecommendedby
MonteCarlomethods=1)

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

Changesrecommended
byTDmethods(=1)

27

AdvantagesofTDLearning
TDmethodsdonotrequireamodeloftheenvironment,
onlyexperience
TDmethodscanbefullyincremental
Youcanlearnbeforeknowingthefinaloutcome
Lessmemory
Lesspeakcomputation
Youcanlearnwithoutthefinaloutcome
Fromincompletesequences

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

28

RandomWalkExample

ValueslearnedbyTD(0)
after
variousnumbersofepisodes

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

29

TDandMContheRandomWalk

Dataaveragedover
100sequencesofepisodes
R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

30

OptimalityofTD(0)
BatchUpdating:traincompletelyonafiniteamountofdata,
e.g.,trainrepeatedlyon10episodesuntilconvergence.
ComputeupdatesaccordingtoTD(0),butonlyupdate
estimatesaftereachcompletepassthroughthedata.
ForanyfiniteMarkovpredictiontask,underbatchupdating,
TD(0)convergesforsufficientlysmall.
ConstantMCalsoconvergesundertheseconditions,butto
adifferenceanswer!

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

31

RandomWalkunderBatchUpdating

Aftereachnewepisode,allpreviousepisodesweretreatedasabatch,
andalgorithmwastraineduntilconvergence.Allrepeated100times.
R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

32

LearningAnActionValueFunction
Estimate Q forthecurrentbehaviorpolicy .

Aftereverytransitionfromanonterminalstate st , dothis :
Q st , at Qst , at rt 1 Qst 1 ,at 1 Qst ,at
Ifst 1isterminal, thenQ(st 1, at 1 ) 0.

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

33

Sarsa:OnPolicyTDControl
Turnthisintoacontrolmethodbyalwaysupdatingthe
policytobegreedywithrespecttothecurrentestimate:

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

34

WindyGridworld

undiscounted,episodic,reward=1untilgoal

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

35

ResultsofSarsaontheWindyGridworld

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

36

Cliffwalking

greedy=0.1

R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction

37

You might also like