Reinforcement Learning Cheat Sheet
Optimal
Similarly, we can do the same for the Q function: π
v∗ (s) = max v (s)
Agent-Environment Agent-Environment Interface
(5)
π
qπ (s, a)
Action-Value (Q) Function
∞
We can also denoted the expected reward for state, action pairs. qπ (s, a) = E π Gt |S t = s, A t = a (6)
Optimal
= E π
Policy A policy is policy is a mapping from a state to an action πt (s|a) (1) That is the probability of select an action A t = a if S t = s .
Reward The total reward total reward is expressed as: H
Gt =
γ k rt+k+1
(2)
k =0
Where γ is the discount the discount factor and H is the horizon the horizon , that can be infinite.
Markov Decision Process Markov Decision Process, MPD, is a 5-tuple A Markov (S,A,P,R,γ ) where: finite set of states: s ∈ S finite set of actions: a∈A state transition probabilities: p(s |s, a) = P r {S t+1 = s |S t = s, At = a } expected reward for state-action-nexstate: r(s , s , a) = E [Rt+1|S t+1 = s , S t = s, At = a ]
max qπ∗ (s, a) v∗ (s) = max
We can avoid to wait until V (s) has converged and instead do policy improvement and truncated policy evaluation step in one operation
(8)
a∈A(s)
=
k=0
γ k Rt+k+2 |S t+1 = s
p(s , r |s, a) r + γV π (s )
s ,r
(10)
Dynamic Programming
Intuitively, the above equation express the fact that the value of a state under the optimal policy must be equal to the expected return from the best action from that state.
Taking advantages of the subproblem structure of the V and Q function we can find the optimal policy by just planning
Bellman Equation
Policy Iteration
An important recursive property emerges for both Value (4) and Q (6 (6) functions if we expand them.
We can now find the optimal policy 1. Initialisation V (s) ∈ R, (e.g V (s) = 0) and π (s) ∈ A for all s ∈ S , ∆←0 2. Policy Evaluation while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) V (s) ← π(a|s) p(s , r |s, a) r + γV (s )
Value Function
= E π Gt |S t = s
k=0
γ k Rt+k+1|S t = s ∞
Value Iteration
p(s , r |s, a) r + γ Eπ
s ,r
(7)
Clearly, using this new notation we can redefine v ∗ , equation 5, using 5, using q ∗ (s, a), equation 7: equation 7:
∞
Value function describes how good is good is to be in a specific state s under a certain certain policy π . For MDP: (4) V π (s) = E [Gt |S t = s ] Informally, is the expected return (expected cumulative discounted reward) when starting from s and following π
k=0
γ k Rt+k+2|S t = s, At = a ∞
π
= E π
Value Function
γ k Rt+k+1 |S t = s, At = a ∞
=
q∗ (s, a) = max qπ (s, a)
vπ (s)
(3)
k =0
= E π Rt+1 + γ
The optimal value-action function: The Agent at each step t receives a representation of the environment’s state , S t ∈ S and it selects an action A t ∈ A(s). Then, as a consequence of its action the agent receives a reward , R t+1 ∈ R ∈ R.
= E π Gt |S t = s, A t = a
= E π Rt+1 + γ =
k=0
s
r
Sum of all probabilities ∀ possible r ∞
r + γ Eπ
k=0
γ k Rt+k+2|S t+1 = s
Expected reward from s t+1
=
a
π(a|s)
s
r
p(s , r |s, a) r + γv π (s )
Initialise V (s) ∈ R, e.gV (s) = 0 ∆←0 while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) V (s) ← max p(s , r |s, a) r + γV (s ) a
max(∆
s ,r
( ))
a
s ,r
∆ ← max(∆, |v − V (s)|)
γ k Rt+k+2 |S t = s p(s , r |s, a)
π (a|s)
a
end end
(9)
3. Policy Improvement policy-stable ← true foreach s ∈ S do old-action ← ← π(s) π (s) ← argmax p(s , r |s, a) r + γV (s ) a
s ,r
policy-stable ← ← old-action = π (s)
end
if policy-stable policy-stable return V ≈ v∗ and π ≈ π∗ , else go to 2
Algorithm Algorithm 1: Policy 1: Policy Iteration
Monte Carlo Methods
Free method, It does not require Monte Carlo (MC) is a Model a Model Free complete knowledge of the environment. It is based on
averaging sample returns for each state-action pair. The
∞
Initialise for all s ∈ S, a ∈ A(s) : Q(s, a) ← arbitrary π(s) ← arbitrary Returns(s, a) ← empty list while forever do Choose S 0 ∈ S and A 0 ∈ A(S 0 ), all pairs have probability > 0 Generate an episode starting at S 0 , A0 following π foreach pair s, a appearing in the episode do G ← return following the first occurrence of s, a Append G to Returns(s, a)) Q(s, a) ← average(Returns(s, a)) end foreach s in the episode do π (s) ← argmax Q(s, a)
qtλ = (1 − λ)
end end
Algorithm 3: Monte Carlo first-visit For non-stationary problems, the Monte Carlo estimate for, e.g, V is: V (S t ) ← V (S t ) + α Gt − V (S t ) (11)
Where α is the learning rate, how much we want to forget about past experiences.
Sarsa Sarsa (State-action-reward-state-action) is a on-policy TD control. The update rule: Q(st , at ) ← Q(st , at ) + α [rt + γQ (st+1 , at+1) − Q(st , at )]
Sarsa
Define the n -step Q-Return q(n) = R t+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(S t+n )
n−1
λ
Created by DeepMind , Deep Q Learning, DQL, substitutes the Q function with a deep neural network called Q-network . It also keep track of some observation in a memory in order to use them to train the network.
(n)
qt
n=1
Forward-view Sarsa(λ):
Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at )
(n)
Q(st , at ) ← Q(st , at ) + α qt
Double Deep Q Learning
− Q(st , at )
end
end
Algorithm 4: Sarsa(λ)
Temporal Difference - Q Learning Temporal Difference (TD) methods learn directly from raw experience without a model of the environment’s dynamics. TD substitutes the expected discounted reward G t from the episode with an estimation: (12) V (S t ) ← V (S t ) + α Rt+1 + γV (S t+1 − V (S t ) The following algorithm gives a generic implementation. Initialise Q (s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do while s is not terminal do Choose a from s using policy derived from Q (e.g., -greedy) Take action a , observer r, s Q(s, a) ← Q(s, a) + α r + γ max Q(s , a ) − Q(s, a)
s ← s end
a
end
Algorithm 5: Q Learning
Li (θi ) = E (s,a,r,s )∼U (D)
Initialise Q (s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do Choose a from s using policy derived from Q (e.g., -greedy) while s is not terminal do Take action a , observer r, s Choose a from s using policy derived from Q (e.g., -greedy) Q(s, a) ← Q(s, a) + α r + γQ (s , a ) − Q(s, a) s ← s a ← a
n-step Sarsa update Q (S, a) towards the n -step Q-return
a
n-step
Deep Q Learning
Forward View Sarsa(λ)
following algorithm gives the basic implementation
(r + γ max Q(s , a ; θi−1 ) − Q(s, a; θi ))2 a
target
prediction
(13)
Where θ are the weights of the network and U (D) is the experience replay history. Initialise replay memory D with capacity N Initialise Q (s, a) arbitrarily foreach episode ∈ episodes do while s is not terminal do With probability select a random action a ∈ A(s) otherwise select a = maxa Q(s, a; θ) Take action a , observer r, s Store transition (s,a,r,s ) in D Sample random minibatch of transitions (sj , aj , rj , sj ) from D Set y j ← for terminal s j rj rj + γ max Q(s , a ; θ) for non-terminal s j
a
Perform gradient descent step on (yj − Q(sj , aj ; Θ))2 s ← s
end end
Algorithm 6: Deep Q Learning c 2018 Francesco Saverio Zuppichini Copyright https://github.com/FrancescoSaverioZuppichini/ReinforcementLearning-Cheat-Sheet