I have 'successfully' set up a Q-network for solving the 'FrozenLake-v0' env of the OpenAI gym (at least, I think.. not 100% sure how I score - I get 70 to 80 out of 100 successful episodes after 5k episodes of training without Experience Replay). I'm still quite new to these kinds of programming problems, but I do have several years of overall programming experience. I use the latest gym, Python 3.6.4 (x64) and Tensorflow 1.7. I have yet to set up Tensorflow-GPU on my 980 Ti rig at home (which from what I've read will blow my CPU out of the water).
Now, I'm trying to improve by implementing experience replay: every step (= one 'experience') is saved as (s, a, r, s')
: state, action, reward, new state. After a minimum pre_train_steps
has been taken (in other words: if a certain amount of 'organic' steps has been taken), every 25 steps (if total_steps % 25 == 0
), I sample 4 random episodes from memory (memory being the last 1000 episodes), and for each of those 4 episodes I sample 4 random, consecutive steps within that episode (episode[n:n+4] where n=rand(0, len(episode) + 1 - 4)
).
The result is 4*4=16 (s, a, r, s')
tuples as samples. For each of these samples, I get Q(s, a)
, Q(s', a')
and max(Q(s', a'))
. Then, I calculate target Q-values, setting targetQ(:,a) = r + gamma * max(Q(s', a')) where gamma = .99
for each of the samples. I then train, using a GradientDescentOptimizer(learning_rate=0.1)
and loss function defined as loss = reduce_sum(square(targetQ - Q))
Testing, without applying experience replay, running 10k episodes (aprox. 290k 'organic' steps) with all parameters (gamma, LR, etc.) equal to those written above, I get consistent results of 70-80 successful per 100 episodes tested. This takes about 9 minutes to run on my Lenovo T440s laptop.
Enabling experience replay however, running 10k episodes (aprox. 240k 'organic' and 115k 'trained' steps), pre_train_steps = 50k
and train_freq = 25
, results are consistently lower (65-70 successful per 100 episodes), taking slightly shorter (aprox. 8 mins) on my old T440s.
Why? Am I expecting too much from this Experience Replay thing? I thought it'd cut down my times and increase my accuracy (especially preventing the network from only chosing certain paths that it gets 'locked' into), but it hardly seems to help at all. Maybe my code is wrong, maybe I'm using the wrong parameters? It'd be a great help if someone could look this over and point me in the right direction, as I'd like to keep increasing the complexity of my network(s) and trying out different environments, but before I do I want to know I'm not doing something completely wrong...
TIA!
Full code: https://pastebin.com/XQU2Tx18