Rl Cheatsheet

Reinforcement Learning Cheat Sheet

Optimal

Similarly, we can do the same for the Q function: π

v∗ (s) = max v (s)

Agent-Environment Agent-Environment Interface

(5)

π

qπ (s, a)

Action-Value (Q) Function

∞

We can also denoted the expected reward for state, action pairs. qπ (s, a) = E π Gt |S t = s, A t = a (6)



Optimal

= E π



Policy A policy is policy is a mapping from a state to an action πt (s|a) (1) That is the probability of select an action A t = a if S t = s .

Reward The total reward total reward is expressed as: H

Gt =



γ k rt+k+1

(2)

k =0

Where γ is the discount the discount factor and H is the horizon the horizon , that can be infinite.

Markov Decision Process Markov Decision Process, MPD, is a 5-tuple A Markov (S,A,P,R,γ ) where: finite set of states: s ∈ S finite set of actions: a∈A state transition probabilities: p(s |s, a) = P r {S t+1 = s  |S t = s, At = a } expected reward for state-action-nexstate: r(s , s , a) = E [Rt+1|S t+1 = s  , S t = s, At = a ]

max qπ∗ (s, a) v∗ (s) = max

We can avoid to wait until V (s) has converged and instead do policy improvement and truncated policy evaluation step in one operation

(8)

a∈A(s)

=

k=0



γ k Rt+k+2 |S t+1 = s 



p(s , r |s, a) r + γV π (s )

s ,r

(10)

Dynamic Programming

Intuitively, the above equation express the fact that the value of a state under the optimal policy must be equal to the expected return from the best action from that state.

Taking advantages of the subproblem structure of the V and Q function we can find the optimal policy by just planning

Bellman Equation

Policy Iteration

An important recursive property emerges for both Value (4) and Q (6 (6) functions if we expand them.

We can now find the optimal policy 1. Initialisation V (s) ∈ R, (e.g V (s) = 0) and π (s) ∈ A for all s ∈ S , ∆←0 2. Policy Evaluation while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) V (s) ← π(a|s) p(s , r |s, a) r + γV (s )

Value Function

                         = E π Gt |S t = s

k=0

γ k Rt+k+1|S t = s ∞

Value Iteration

p(s , r |s, a) r + γ Eπ

s ,r

(7)

Clearly, using this new notation we can redefine v ∗ , equation 5, using 5, using q ∗ (s, a), equation 7: equation 7:

∞

Value function describes how good is good is to be in a specific state s under a certain certain policy π . For MDP: (4) V π (s) = E [Gt |S t = s ] Informally, is the expected return (expected cumulative discounted reward) when starting from s and following π

k=0

γ k Rt+k+2|S t = s, At = a ∞

π

= E π

Value Function

γ k Rt+k+1 |S t = s, At = a ∞

=

q∗ (s, a) = max qπ (s, a)

vπ (s)

(3)

k =0

= E π Rt+1 + γ

The optimal value-action function: The Agent at each step t receives a representation of the environment’s state , S t ∈ S and it selects an action A t ∈ A(s). Then, as a consequence of its action the agent receives a reward , R t+1 ∈ R ∈ R.

           

= E π Gt |S t = s, A t = a

= E π Rt+1 + γ =

k=0

s

r

Sum of all probabilities ∀ possible r ∞

r + γ Eπ

k=0

γ k Rt+k+2|S t+1 = s 

Expected reward from s t+1

=

a

π(a|s)

s

r

p(s , r |s, a) r + γv π (s )

Initialise V (s) ∈ R, e.gV (s) = 0 ∆←0 while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) V (s) ← max p(s , r |s, a) r + γV (s ) a

max(∆

 

s ,r

( ))









a

s ,r

∆ ← max(∆, |v − V (s)|)

γ k Rt+k+2 |S t = s p(s , r |s, a)

π (a|s)

a





end end

(9)

3. Policy Improvement policy-stable ← true foreach s ∈ S do old-action ← ← π(s) π (s) ← argmax p(s , r |s, a) r + γV (s ) a



s ,r



policy-stable ← ← old-action = π (s)



end

if policy-stable policy-stable return V ≈ v∗ and π ≈ π∗ , else go to 2

Algorithm Algorithm 1: Policy 1: Policy Iteration

Monte Carlo Methods

Free method, It does not require Monte Carlo (MC) is a Model a Model Free complete knowledge of the environment. It is based on



averaging sample returns for each state-action pair. The

∞

Initialise for all s ∈ S, a ∈ A(s) : Q(s, a) ← arbitrary π(s) ← arbitrary Returns(s, a) ← empty list while forever do Choose S 0 ∈ S and A 0 ∈ A(S 0 ), all pairs have probability > 0 Generate an episode starting at S 0 , A0 following π foreach pair s, a appearing in the episode do G ← return following the first occurrence of s, a Append G to Returns(s, a)) Q(s, a) ← average(Returns(s, a)) end foreach s in the episode do π (s) ← argmax Q(s, a)

qtλ = (1 − λ)

end end

Algorithm 3: Monte Carlo first-visit For non-stationary problems, the Monte Carlo estimate for, e.g, V is: V (S t ) ← V (S t ) + α Gt − V (S t ) (11)





Where α is the learning rate, how much we want to forget about past experiences.

Sarsa Sarsa (State-action-reward-state-action) is a on-policy TD control. The update rule: Q(st , at ) ← Q(st , at ) + α [rt + γQ (st+1 , at+1) − Q(st , at )]

Sarsa

Define the n -step Q-Return q(n) = R t+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(S t+n )

n−1

λ

Created by DeepMind , Deep Q Learning, DQL, substitutes the Q function with a deep neural network called Q-network . It also keep track of some observation in a memory in order to use them to train the network.

(n)

qt

n=1

Forward-view Sarsa(λ):





Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at )

(n)

Q(st , at ) ← Q(st , at ) + α qt

Double Deep Q Learning

− Q(st , at )





end

end

Algorithm 4: Sarsa(λ)

Temporal Difference - Q Learning Temporal Difference (TD) methods learn directly from raw experience without a model of the environment’s dynamics. TD substitutes the expected discounted reward G t from the episode with an estimation: (12) V (S t ) ← V (S t ) + α Rt+1 + γV (S t+1 − V (S t ) The following algorithm gives a generic implementation. Initialise Q (s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do while s is not terminal do Choose a from s using policy derived from Q (e.g.,  -greedy) Take action a , observer r, s Q(s, a) ←   Q(s, a) + α r + γ max Q(s , a ) − Q(s, a)



s ← s end





a

end

Algorithm 5: Q Learning

Li (θi ) = E (s,a,r,s )∼U (D) 

Initialise Q (s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do Choose a from s using policy derived from Q (e.g., -greedy) while s is not terminal do Take action a , observer r, s Choose a  from s  using policy derived from Q (e.g.,  -greedy) Q(s, a) ← Q(s, a) + α r + γQ (s , a ) − Q(s, a) s ← s a ← a

n-step Sarsa update Q (S, a) towards the n -step Q-return







a

n-step

Deep Q Learning

Forward View Sarsa(λ)

following algorithm gives the basic implementation



  

     

(r + γ max Q(s , a ; θi−1 ) − Q(s, a; θi ))2 a



target

prediction

(13)

Where θ are the weights of the network and U (D) is the experience replay history. Initialise replay memory D with capacity N Initialise Q (s, a) arbitrarily foreach episode ∈ episodes do while s is not terminal do With probability  select a random action a ∈ A(s) otherwise select a = maxa Q(s, a; θ) Take action a , observer r, s Store transition (s,a,r,s  ) in D Sample random minibatch of transitions (sj , aj , rj , sj ) from D Set y j ← for terminal s j rj rj + γ max Q(s , a ; θ) for non-terminal s j



a

Perform gradient descent step on (yj − Q(sj , aj ; Θ))2 s ← s

end end

Algorithm 6: Deep Q Learning c 2018 Francesco Saverio Zuppichini Copyright  https://github.com/FrancescoSaverioZuppichini/ReinforcementLearning-Cheat-Sheet

Rl Cheatsheet

Recommend Documents