Speedy Q-Learning

Question

I've read on wikipedia https://en.wikipedia.org/wiki/Q-learning

Q-learning may suffer from slow rate of convergence, especially when the discount factor {\displaystyle \gamma } \gamma is close to one.[16] Speedy Q-learning, a new variant of Q-learning algorithm, deals with this problem and achieves a slightly better rate of convergence than model-based methods such as value iteration

So I wanted to try speedy q-learning, and see how better it is.

The only source about it I could find on the internet is this: https://papers.nips.cc/paper/4251-speedy-q-learning.pdf

That's the algorithm they suggest.

Now, I didn't understand it. what excactly is TkQk, Am I supposed to have another list of q-values? Is there any clearer explanation than this?

 Q[previousState][action] = ((Q[previousState][action]+(learningRate * ( reward + discountFactor * maxNextExpectedReward - Q[previousState][action]) )));

this is my current QLearning algorithm, I want to replace it to speedy Q-learning.

*or another simpler source about speedy Q-learning.* - Requests for external resources are off topic, you should [edit] that part out. — BSMP, Jan 16 '17 at 21:53

score 1 · Accepted Answer · answered Jan 18 '17 at 09:06

A first consideration: if you are trying to speed up Q-learning for a practical problem, I would choose other options before Speedy Q-learning, such as the well-known Q(lambda), i.e., Q-learning combined with elegibility traces. Why? Becuase there are tons of information and experimental (good) results with eligibility traces. In fact, as the Speedy Q-learning authors suggest, the working principle of both methods are similar:

The idea of using previous estimates of the action-values has already been used to improve the performance of Q-learning. A popular algorithm of this kind is Q(lambda) [14, 20], which incorporates the concept of eligibility traces in Q-learning, and has been empirically shown to have a better performance than Q-learning, i.e., Q(0), for suitable values of lambda.

You can find a nice introduction in the Sutton and Barto RL book. If you are simply interested in study the differences between Speedy Q-learning and the standard version, go on.

A now your question. Yes, you have to maintain two separate lists of Q-values, one for the current time k and another for the previous k-1, namely, Q_{k} and Q_{k-1} respectively.

In the common case (including your case), TQ_{k} = r(x,a) + discountFactor * max_{b in A} Q_{k}(y,b), where y is the next state and b the action that maximizes Q_{k}for the given state. Notice that you are using that operator in the standard Q-learning, which has the following update rule:

In the case of Speedy Q-learning (SQL), as peviously stated, you maintain two Q-functions and apply the operation TQ to both: TQ_{k} and TQ_{k-1}. Then the results of the previous operations are used in the SQL update rule:

Another point to highlight in the pseudo-code you post in your question, it that it corresponds with the synchronous version of SQL. This means that, in each time step k, you need to generate the next state y and update Q_{k+1}(x,a) for all the existing state-actions pairs (x,a).

Thanks pablo! You are the best. Yes I'm trying to speed up Q-learning for a practical problem. I'm trying to make a bot for my game, with deep q-learning inspired by the Ataris bot. My neural network didnt do so well, So I took a step-back to check my q-learning and saw that the q-values didnt make much sense. I've found out that it was caused mostly because of some small bugs I had. but on my way looking for a solution I came across speedy, and wanted to try it. I guess I'll try Q(lambda) instead — menta man, Jan 18 '17 at 14:34
Thanks, it's a pleasure to be helpful :). Good luck with your bot project, it sounds very interesting. — Pablo EM, Jan 18 '17 at 15:30

Speedy Q-Learning

1 Answers1