I cannot understand what the fundamental difference between on-policy methods (like A3C
) and off-policy methods (like DDPG
) is. As far as I know, off-policy methods can learn the optimal policy regardless of the behavior policy. It can learn by observing any trajectory in the environment. Therefore, can I say off-policy methods are better than on-policy methods?
I have read the cliff-walking example showing the difference between SARSA
and Q-learning
. It says that Q-learning
would learn the optimal policy to walk along the cliff, while SARSA
would learn to choose a safer way when using the epsilon-greedy
policy. But since Q-learning
have already told us that the optimal policy, why don't we just follow that policy instead of keep exploring?
Plus, are there situations for the two kinds of learning methods that one is better than the other? in which case would one prefer on-policy algorithms?