5

I cannot understand what the fundamental difference between on-policy methods (like A3C) and off-policy methods (like DDPG) is. As far as I know, off-policy methods can learn the optimal policy regardless of the behavior policy. It can learn by observing any trajectory in the environment. Therefore, can I say off-policy methods are better than on-policy methods?

I have read the cliff-walking example showing the difference between SARSA and Q-learning. It says that Q-learning would learn the optimal policy to walk along the cliff, while SARSA would learn to choose a safer way when using the epsilon-greedy policy. But since Q-learning have already told us that the optimal policy, why don't we just follow that policy instead of keep exploring?

Plus, are there situations for the two kinds of learning methods that one is better than the other? in which case would one prefer on-policy algorithms?

DarkZero
  • 2,259
  • 3
  • 25
  • 36

1 Answers1

8

As you already said, off-policy methods can learn the optimal policy regardless of the behaviour policy (actually the behaviour policy should have some properties), while on-policy methods require the agent acts with the policy that it's being learnt.

Imagine the situation where you have a data set of trajectories (i.e, data in form of tuples (s,a,r,s')) previously stored. This data has been collected applying a given policy, and you cannot change it. In this case, which is common for medical problems, you can apply only off-policy methods.

This means that off-policy methods are better? No necessarily. We can say that off-policy methods are more flexible in the type of problems they can face. However, from the theoretical point of view, they have different properties that sometimes are convenient. For instance, if we compare Q-learning versus SARSA algorithm, a key difference between them is the max operator used in the Q-learning update rule. This operator is highly non-linear, which can make more difficult to combine the algorithm with function approximators.

When it's better to use on-policy methods? Well, if you are facing a problem with continuous state-space and you are interested in using a linear function approximator (and RFB network, for instance). Then it's more stable to use on-policy methods. You can find more information on this topic in the Section off-policy bootstrapping of Sutton and Barto's book.

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • Thank you! So Q learning is more difficult to be approximated by linear functions, but now DQN is invented and tricks like experience replay have made it stable. Is it still a bad part then? – DarkZero Mar 06 '17 at 08:50
  • Experience replay actually is quite old, but yes, this kind of tricks made Q-learning more stable. I cannot say that there is a "bad part", only that they have different theoretical properties and sometimes these properties play a key role in their practical application. Currently, and specially in the deep learning field, Q-learning is likely more popular and widely used. – Pablo EM Mar 06 '17 at 09:04
  • Can we say on-policy methods often converge faster, or more stable, compared to off-policy ones? So it can be used when we want to find a fairly good policy in a short time? – DarkZero Mar 06 '17 at 09:35
  • In general, on-policy metfhods often converge faster, however this does not mean that is computationally less demanding. Also, they provide more theoretical convergence properties when combined with linear function approximators. A more specific comparison would depend on the particular algorithms considered and on the problem at hand. – Pablo EM Mar 06 '17 at 11:35
  • @DarkZero, please, let me know if you have additional doubts. Otherwise, if the answer was useful, don't forget to accept it. – Pablo EM Mar 07 '17 at 09:55
  • Sorry for late comment, so can we say that on default we can choose off-policy methods? Because it seems rare to face a problem with continuous state space, and few people want to use linear approximator now, they all prefer neural networks... – DarkZero Mar 07 '17 at 11:41
  • Well, "rare" and "few people" depends very much on the ambit. But if you are working in a kind of applications where linear approximators are unusual, yes off-policy methods can be the default option. On the other hand, continuous state-space means that the state space if defined by continuous variables, which happens very often. For example in a simply cart-pole problem, where angle and agular velocity are continuous variables. – Pablo EM Mar 07 '17 at 12:07
  • Oh I know the cart pole example. So in practice, we often have to try both to determine which is the better choice for the problem, right? – DarkZero Mar 07 '17 at 13:16
  • Yes, in general is difficult to know in advance which exact algorithm will perform better than the rest. However, if you are a wide experience with the domain and/or the algorithms, probably you will have a feeling of which kind of algorithm can fit better the requirements of the problem. – Pablo EM Mar 07 '17 at 13:51
  • Ok, I got it. Thank you for all the discussions! – DarkZero Mar 08 '17 at 02:20