TD learning vs Q learning

Question

In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?

As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?

TD learning refers to a group of algorithms, of which Q learning is one example. — Don Reba, Feb 26 '16 at 11:49

score 2 · Answer 1 · answered Feb 26 '16 at 11:49

2

Q-Learning is a TD (temporal difference) learning method.

I think you are trying to refer to TD(0) vs Q-learning.

I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.

answered Feb 26 '16 at 11:49

Juan Leni

6,982
5
55
87

if both TD(0) and Q-learning use function approximation, both methods are computationally equal when determining which action to take next, aren't they? TD(0) needs to traverse all available actions and calculate each afterstate's value; Q-learning would need to calculate each Q(s,a) – czxttkl Jul 10 '17 at 20:37

Pablo EM · Answer 2 · 2016-03-01T09:58:17.567

Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (i.e., compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).

However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.

In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.

You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.

score 0 · Answer 3 · answered Feb 04 '20 at 09:53

0

Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.

answered Feb 04 '20 at 09:53

Nicolás Esteban Cofré Ramírez

11
3

score 0 · Answer 4 · answered Jul 31 '20 at 13:34

Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo

TD learning vs Q learning

4 Answers4