Update Rule in Temporal difference

Question

The update rule TD(0) Q-Learning:

Q(t-1) = (1-alpha) * Q(t-1) + (alpha) * (Reward(t-1) + gamma* Max( Q(t) ) )
Then take either the current best action (to optimize) or a random action (to explorer)

Where MaxNextQ is the maximum Q that can be got in the next state...

But in TD(1) I think update rule will be:

Q(t-2) = (1-alpha) * Q(t-2) + (alpha) * (Reward(t-2) + gamma * Reward(t-1) + gamma * gamma * Max( Q(t) ) )

My question:
The term gamma * Reward(t-1) means that I will always take my best action at t-1 .. which I think will prevent exploring..
Can someone give me a hint?

Thanks

So when you say TD(2) you are looking to choose actions based on the next two steps? — Zach Varberg, May 28 '10 at 22:44

score 2 · Accepted Answer · answered May 29 '10 at 18:20

2

You are talking about "eligibility traces" usage, right? See the equations and the algorithm.

Notice the e_t(s, a) equation there. No penalty is applied when using an exploration step.

answered May 29 '10 at 18:20

Ivo Danihelka

3,382
3
31
27

Update Rule in Temporal difference

1 Answers1