RL

ReinforcementLearning
Slidesfrom
R.S.SuttonandA.G.Barto
ReinforcementLearning:AnIntroduction
http://www.cs.ualberta.ca/~sutton/book/thebook.html
http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse.html
TheAgentEnvironmentInterface
Agentandenvironmentinteractatdiscretetimesteps : t 0,1, 2, K
Agentobservesstateatstep t : st S
producesactionatstep t : at A(st )
getsresultingreward : rt 1
andresultingnextstate : st 1
...
st
at
rt+1
st+1
at+1
rt+2
st+2
at+2
rt+3 s
t+3
...
at+3
TheAgentLearnsaPolicy
Policyatstept, t :
amappingfromstatestoactionprobabilities
t (s, a) probabilitythat at awhenst s
Reinforcementlearningmethodsspecifyhowtheagent
changesitspolicyasaresultofexperience.
Roughly,theagentsgoalistogetasmuchrewardasit
canoverthelongrun.
GettingtheDegreeofAbstractionRight
Timestepsneednotrefertofixedintervalsofrealtime.
Actionscanbelowlevel(e.g.,voltagestomotors),orhigh
level(e.g.,acceptajoboffer),mental(e.g.,shiftinfocus
ofattention),etc.
Statescanbelowlevelsensations,ortheycanbe
abstract,symbolic,basedonmemory,orsubjective(e.g.,
thestateofbeingsurprisedorlost).
AnRLagentisnotlikeawholeanimalorrobot,which
consistofmanyRLagentsaswellasothercomponents.
Theenvironmentisnotnecessarilyunknowntotheagent,
onlyincompletelycontrollable.
Rewardcomputationisintheagentsenvironmentbecause
theagentcannotchangeitarbitrarily.
GoalsandRewards
Isascalarrewardsignalanadequatenotionofagoal?
maybenot,butitissurprisinglyflexible.
Agoalshouldspecifywhatwewanttoachieve,nothow
wewanttoachieveit.
Agoalmustbeoutsidetheagentsdirectcontrolthus
outsidetheagent.
Theagentmustbeabletomeasuresuccess:
explicitly;
frequentlyduringitslifespan.
Returns
Supposethesequenceofrewardsafterstep tis :
rt 1 , rt 2 , rt 3 , K
Whatdowewanttomaximize?
Ingeneral,
wewanttomaximizethe expectedreturn, ERt ,foreachstept.
Episodictasks:interactionbreaksnaturallyinto
episodes,e.g.,playsofagame,tripsthroughamaze.
Rt rt 1 rt 2 L rT ,
whereTisafinaltimestepatwhichaterminalstateis
reached,endinganepisode.
ReturnsforContinuingTasks
Continuingtasks:interactiondoesnothavenaturalepisodes.
Discountedreturn:
Rt rt 1 rt 2 2 rt 3 L k rt k 1 ,
k 0
where , 0 1,isthediscountrate.
shortsighted0 1farsighted
AnExample
Avoidfailure:thepolefallingbeyond
acriticalangleorthecarthittingendof
track.
Asanepisodictaskwhereepisodeendsuponfailure:
reward 1foreachstepbeforefailure
return numberofstepsbeforefailure
Asacontinuingtaskwithdiscountedreturn:
reward 1uponfailure; 0otherwise

return k ,forkstepsbeforefailure
Ineithercase,returnismaximizedby
avoidingfailureforaslongaspossible.
AnotherExample
Gettothetopofthehill
asquicklyaspossible.
reward 1foreachstepwhere notattopofhill

return numberofstepsbeforereachingtopofhill
Returnismaximizedbyminimizing
numberofstepsreachthetopofthehill.
AUnifiedNotation
Inepisodictasks,wenumberthetimestepsofeach
episodestartingfromzero.
Weusuallydonothavedistinguishbetweenepisodes,so
st
st, j
wewriteinsteadofforthestateatsteptof
episodej.
Thinkofeachepisodeasendinginanabsorbingstatethat
alwaysproducesrewardofzero:
R
Wecancoverallcasesbywriting t rt k 1 ,
k 0
where canbe1onlyifazerorewardabsorbingstateisalwaysreached.
TheMarkovProperty
Bythestateatstept,thebookmeanswhateverinformationis
availabletotheagentatsteptaboutitsenvironment.
Thestatecanincludeimmediatesensations,highlyprocessed
sensations,andstructuresbuiltupovertimefromsequencesof
sensations.
Ideally,astateshouldsummarizepastsensationssoastoretain
allessentialinformation,i.e.,itshouldhavetheMarkov
Property:
Prst 1 s,r
t 1 r st ,at ,rt , st 1 ,at 1 ,K ,r1 ,s0 ,a0
Prst 1 s,r
t 1 r st ,at
forall s, r, andhistoriesst ,at ,rt , st 1 ,at 1 ,K ,r1, s0 ,a0 .
MarkovDecisionProcesses
IfareinforcementlearningtaskhastheMarkovProperty,itis
basicallyaMarkovDecisionProcess(MDP).
Ifstateandactionsetsarefinite,itisafiniteMDP.
TodefineafiniteMDP,youneedtogive:
stateandactionsets
onestepdynamicsdefinedbytransitionprobabilities:
Psas Prst 1 s st s,at aforalls, sS, a A(s).
rewardprobabilities:
Rsas Ert 1 st s,at a,s t 1 sforalls, sS, a A(s).
AnExampleFiniteMDP
RecyclingRobot
Ateachstep,robothastodecidewhetheritshould(1)activelysearchforacan,
(2)waitforsomeonetobringitacan,or(3)gotohomebaseandrecharge.
Searchingisbetterbutrunsdownthebattery;ifrunsoutofpowerwhile
searching,hastoberescued(whichisbad).
Decisionsmadeonbasisofcurrentenergylevel:high,low.
Reward=numberofcanscollected
ValueFunctions
Thevalueofastateistheexpectedreturnstartingfrom
thatstate;dependsontheagentspolicy:
State valuefunctionforpolicy :
k
V (s) E Rt st s E rt k 1 st s
k 0
Thevalueoftakinganactioninastateunderpolicy
istheexpectedreturnstartingfromthatstate,takingthat
action,andthereafterfollowing:
Action valuefunctionforpolicy :
k
Q (s, a) E Rt s t s, at a E rt k 1 s t s,at a
k 0
BellmanEquationforaPolicy
Thebasicidea:
Rt rt 1 rt 2 2 rt 3 3 rt 4 L
rt 1 rt 2 rt 3 rt 4 L
2
rt 1 Rt 1
So:
V (s) E Rt st s
E rt 1 V st 1 st s
Or,withouttheexpectationoperator:
V (s) (s,a) PssRss V ( s)

MoreontheBellmanEquation
V (s) (s,a) PssRss V ( s)

Thisisasetofequations(infact,linear),oneforeachstate.
Thevaluefunctionforisitsuniquesolution.
Backupdiagrams:
forV
forQ
Gridworld
Actions:north,south,east,west;deterministic.
Ifwouldtakeagentoffthegrid:nomovebutreward=1
Otheractionsproducereward=0,exceptactionsthatmove
agentoutofspecialstatesAandBasshown.
Statevaluefunction
forequiprobable
randompolicy;
=0.9
OptimalValueFunctions
ForfiniteMDPs,policiescanbepartiallyordered:

ifandonlyif V (s) V (s)foralls S
Thereisalwaysatleastone(andpossiblymany)policiesthat
isbetterthanorequaltoalltheothers.Thisisanoptimal
policy.Wedenotethemall*.
Optimalpoliciessharethesameoptimalstatevaluefunction:
V (s) max V (s)foralls S
Optimalpoliciesalsosharethesameoptimalactionvalue
function:
Q (s,a) max Q (s, a)foralls Sanda A(s)
Thisistheexpectedreturnfortakingactionainstates
andthereafterfollowinganoptimalpolicy.
BellmanOptimalityEquationforV*
Thevalueofastateunderanoptimalpolicymustequal
theexpectedreturnforthebestactionfromthatstate:
V (s) max Q (s,a)

aA(s)
max Ert 1 V (st 1 ) st s, at a
aA(s)
max PsasRsas V (s)

aA(s)
Therelevantbackupdiagram:
V
istheuniquesolutionofthissystemofnonlinearequations.
BellmanOptimalityEquationforQ*
Q (s,a) E rt 1 max Q (st1 , a)

st s,at a
Psas Rsas max Q ( s,

a)
Therelevantbackupdiagram:
Q
istheuniquesolutionofthissystemofnonlinearequations.
WhyOptimalStateValueFunctionsareUseful
V
Anypolicythatisgreedywithrespecttoisanoptimalpolicy.
V
Therefore,given,onestepaheadsearchproducesthe
longtermoptimalactions.
E.g.,backtothegridworld:
WhatAboutOptimalActionValueFunctions?
*
Q
Given,theagentdoesnoteven
havetodoaonestepaheadsearch:
(s) arg max Q (s,a)

aA (s)
SolvingtheBellmanOptimalityEquation
FindinganoptimalpolicybysolvingtheBellmanOptimality
Equationrequiresthefollowing:
accurateknowledgeofenvironmentdynamics;
wehaveenoughspaceantimetodothecomputation;
theMarkovProperty.
Howmuchspaceandtimedoweneed?
polynomialinnumberofstates(viadynamicprogramming
methods;Chapter4),
BUT,numberofstatesisoftenhuge(e.g.,backgammonhas
about10**20states).
Weusuallyhavetosettleforapproximations.
ManyRLmethodscanbeunderstoodasapproximatelysolvingthe
BellmanOptimalityEquation.
TDPrediction
PolicyEvaluation(thepredictionproblem):
foragivenpolicy,computethestatevaluefunction V
ThesimplestTDmethod, TD(0) :
V(st ) V(st ) rt 1 V (st1 ) V(st )
target:anestimateofthereturn
R.S.SuttonandA.G.Barto:ReinforcementLearning:AnIntroduction
24
SimplestTDMethod
V(st ) V(st ) rt 1 V (st1 ) V(st )
st
rt 1
st 1
TT
TT
T
T
TT
T
T
TT
25
Example:DrivingHome
State
ElapsedTime
(minutes)
leavingoffice
0
Predicted
TimetoGo
30
Predicted
TotalTime
30
reachcar,
raining
exithighway
35
40
20
15
35
behindtruck
30
10
40
homestreet
40
43
arrivehome
43
43
26
DrivingHome
Changesrecommendedby
MonteCarlomethods=1)
Changesrecommended
byTDmethods(=1)
27
AdvantagesofTDLearning
TDmethodsdonotrequireamodeloftheenvironment,
onlyexperience
TDmethodscanbefullyincremental
Youcanlearnbeforeknowingthefinaloutcome
Lessmemory
Lesspeakcomputation
Youcanlearnwithoutthefinaloutcome
Fromincompletesequences
28
RandomWalkExample
ValueslearnedbyTD(0)
after
variousnumbersofepisodes
29
TDandMContheRandomWalk
Dataaveragedover
100sequencesofepisodes
30
OptimalityofTD(0)
BatchUpdating:traincompletelyonafiniteamountofdata,
e.g.,trainrepeatedlyon10episodesuntilconvergence.
ComputeupdatesaccordingtoTD(0),butonlyupdate
estimatesaftereachcompletepassthroughthedata.
ForanyfiniteMarkovpredictiontask,underbatchupdating,
TD(0)convergesforsufficientlysmall.
ConstantMCalsoconvergesundertheseconditions,butto
adifferenceanswer!
31
RandomWalkunderBatchUpdating
Aftereachnewepisode,allpreviousepisodesweretreatedasabatch,
andalgorithmwastraineduntilconvergence.Allrepeated100times.
32
LearningAnActionValueFunction
Estimate Q forthecurrentbehaviorpolicy .
Aftereverytransitionfromanonterminalstate st , dothis :
Q st , at Qst , at rt 1 Qst 1 ,at 1 Qst ,at
Ifst 1isterminal, thenQ(st 1, at 1 ) 0.
33
Sarsa:OnPolicyTDControl
Turnthisintoacontrolmethodbyalwaysupdatingthe
policytobegreedywithrespecttothecurrentestimate:
34
WindyGridworld
undiscounted,episodic,reward=1untilgoal
35
ResultsofSarsaontheWindyGridworld
36
Cliffwalking
greedy=0.1
37

RL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL

Uploaded by

Copyright:

Available Formats

ReinforcementLearning

reward 1uponfailure; 0otherwise

reward 1foreachstepwhere notattopofhill

forall s, r, andhistoriesst ,at ,rt , st 1 ,at 1 ,K ,r1, s0 ,a0 .

V (s) max V (s)foralls S

V (s) max Q (s,a)

max Ert 1 V (st 1 ) st s, at a

max PsasRsas V (s)

Q (s,a) E rt 1 max Q (st1 , a)

Psas Rsas max Q ( s,

(s) arg max Q (s,a)

You might also like