Why Q-Learning is Off-Policy Learning?

Question

Hello Stack Overflow Community!

Currently, I am following the Reinforcement Learning lectures of David Silver and really confused at some point in his "Model-Free Control" slide.

In the slides, Q-Learning is considered as off-policy learning. I could not get the reason behind that. Also he mentions we have both target and behaviour policies. What is the role of behaviour policy in Q-Learning?

When I look at the algorithm, it looks so simple like update your Q(s,a) estimate by using the maximum Q(s',a') function. In the slides, it is said as "we choose the next action using behaviour policy" but here we choose only the maximum one.

I am so confused about the Q-Learning algorithm. Can you help me please?

Link of the slide(pages:36-38): http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf

score 1 · Answer 1 · answered Dec 11 '18 at 07:21

1

check this answer first https://stats.stackexchange.com/a/184794

According to my knowledge, target policy is what we set as our policy it could be epsilon-greedy or something else. but in behaviour policy, we just use greedy policy to select the action without even considering what is our target policy, So it estimate our Q assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

answered Dec 11 '18 at 07:21

Soroush

116
7

So, if we update our Q function estimation based on the behaviour policy, then what is the reason behind not taking the action that maximizes our Q function but instead updating our policy and taking the action with some probabilistic approach? What will we gain with this? – test Dec 12 '18 at 15:01
we use probabilistic methods like epsilon-greedy in order to have a little bit of exploration, so our agent could have a better understanding of the environment. Also, I think q-learning use behaviour policy in order to reach faster converge. Let me know it was helpful to you :) – Soroush Dec 12 '18 at 15:48
Thank you so much for the information, it was helpful but I think I could not express the point that I could not get. Let me explain like this, we are taking the action based on our policy, which is the target policy. And here our goal is to find the optimal Q function value with optimal policy. What is the meaning of following another policy, target policy, while taking actions if we use the behaviour policy during the update? We use epsilon-greedy improvement because of exploration problem I know but if we use the behaviour policy to find the optimal Q, why we follow target policy? – test Dec 12 '18 at 16:07
I think the red text should be exchanged. – roachsinai Jul 31 '19 at 12:27
@roachsinai how come? because as far as my knowledge target policy is one we defined(to be greedy) and behavior policy is what we update our value function w.r.t it ( without considering our actual policy) – Soroush Aug 01 '19 at 13:18
Hi @Soroush not sure but what you post should all be target policy. Behavior policy just create episodes. – roachsinai Aug 01 '19 at 13:29

Simon · Answer 2 · 2018-12-17T13:46:31.953

Ideally, you want to learn the true Q-function, i.e., the one that satisfies the Bellman equation

Q(s,a) = R(s,a) + gamma*E[Q(s',a')]   forall s,a

where the expectation is over a' w.r.t the policy.

First, we approximate the problem and get rid of the "forall" because we have access only to few samples (especially in continuous action, where the "forall" results in infinitely many constraints). Second, say you want to learn a deterministic policy (if there is an optimal policy, there is a deterministic optimal policy). Then the expectation disappears, but you need to collect samples somehow. This is where the "behavior" policy comes in, which usually is just a noisy version of the policy you want to optimize (the most common are e-greedy or you add Gaussian noise if the action is continuous).

So now you have samples collected from a behavior policy and a target policy (deterministic) which you want to optimize. The resulting equation is

Q(s,a) = R(s,a) + gamma*Q(s',pi(s'))

the difference between the two sides is the TD error and you want to minimize it given samples collected from the behavior policy

min E[R(s,a) + gamma*Q(s',pi(s')) - Q(s,a)]

where the expectation is approximated with samples (s,a,s') collected using the behavior policy.

If we consider the pseudocode of Soroush, if actions are discrete, then pi(s') = max_A Q(s',A) and the update rule is the derivative of the TD(0) error.

These are some good easy reads to learn more about TD: 1, 2, 3, 4.

EDIT

Just to underline the difference between on- and off-policy. SARSA is on-policy, because the TD error to update the policy is

min E[R(s,a) + gamma*Q(s',a') - Q(s,a)]

a' is the action collected while sampling data using the behavior policy, and it's not pi(s') (the action that the target policy would choose in state s').

So, you are saying that we are trying to create a target policy by using the behavior policy. We follow some episodes on the system and update the behavior policy and while updating it, we also update our target policy to minimize the difference between them. We are trying to get target policy but we are following behavior policy, so Q-learning is off-policy learning? — test, Dec 17 '18 at 13:22
Yes and no. Yes: we update target policy by using the behavior policy. No: we don't update the behavior and we don't minimize the difference between target and behavior. The behavior just adds noise to the target, and the noise is not updated (you may define it to decrease with the learning iteration, but you don't learn it). — Simon, Dec 17 '18 at 13:44

Weizi Li · Answer 3 · 2021-01-20T23:46:56.057

@Soroush's answer is only right if the red text is exchanged. Off-policy learning means you try to learn the optimal policy $\pi$ using trajectories sampled from another policy or policies. This means $\pi$ is not used to generate actual actions that are being executed in the environment. Since A is the executed action from the $\epsilon$-greedy algorithm, it is not from $\pi$ (the target policy) but another policy (the behavior policy, hence the name "behavior").

Why Q-Learning is Off-Policy Learning?

3 Answers3