0

I have 'successfully' set up a Q-network for solving the 'FrozenLake-v0' env of the OpenAI gym (at least, I think.. not 100% sure how I score - I get 70 to 80 out of 100 successful episodes after 5k episodes of training without Experience Replay). I'm still quite new to these kinds of programming problems, but I do have several years of overall programming experience. I use the latest gym, Python 3.6.4 (x64) and Tensorflow 1.7. I have yet to set up Tensorflow-GPU on my 980 Ti rig at home (which from what I've read will blow my CPU out of the water).

Now, I'm trying to improve by implementing experience replay: every step (= one 'experience') is saved as (s, a, r, s'): state, action, reward, new state. After a minimum pre_train_steps has been taken (in other words: if a certain amount of 'organic' steps has been taken), every 25 steps (if total_steps % 25 == 0), I sample 4 random episodes from memory (memory being the last 1000 episodes), and for each of those 4 episodes I sample 4 random, consecutive steps within that episode (episode[n:n+4] where n=rand(0, len(episode) + 1 - 4)).

The result is 4*4=16 (s, a, r, s') tuples as samples. For each of these samples, I get Q(s, a), Q(s', a') and max(Q(s', a')). Then, I calculate target Q-values, setting targetQ(:,a) = r + gamma * max(Q(s', a')) where gamma = .99 for each of the samples. I then train, using a GradientDescentOptimizer(learning_rate=0.1) and loss function defined as loss = reduce_sum(square(targetQ - Q))

Testing, without applying experience replay, running 10k episodes (aprox. 290k 'organic' steps) with all parameters (gamma, LR, etc.) equal to those written above, I get consistent results of 70-80 successful per 100 episodes tested. This takes about 9 minutes to run on my Lenovo T440s laptop.

Enabling experience replay however, running 10k episodes (aprox. 240k 'organic' and 115k 'trained' steps), pre_train_steps = 50k and train_freq = 25, results are consistently lower (65-70 successful per 100 episodes), taking slightly shorter (aprox. 8 mins) on my old T440s.

Why? Am I expecting too much from this Experience Replay thing? I thought it'd cut down my times and increase my accuracy (especially preventing the network from only chosing certain paths that it gets 'locked' into), but it hardly seems to help at all. Maybe my code is wrong, maybe I'm using the wrong parameters? It'd be a great help if someone could look this over and point me in the right direction, as I'd like to keep increasing the complexity of my network(s) and trying out different environments, but before I do I want to know I'm not doing something completely wrong...

TIA!

Full code: https://pastebin.com/XQU2Tx18

Floris
  • 653
  • 4
  • 10

1 Answers1

1

By inspecting the code in your link, I get the impression that:

  • e is the epsilon parameter of the epsilon-greedy strategy
  • batch_train appears to be the parameter that decides whether or not to use Experience Replay?

Assuming the above is correct: one thing that stands out to me is this block of code

for i, experience in enumerate(training_batch):
    s, a, r, ss, d = experience # s a r s' d

    if int(r) == 1:
        e -= e_factor

        if e < e_end:
            e = e_end

    target_Qs[i][int(a)] = r + QN1.gamma * new_Qs_max[i]

which is inside the if-block conditioned on batch_train == True, in other words, the snippet of code above only runs in the case where Experience Replay is used. That code appears to be decaying your epsilon parameter.

Generally, you don't want epsilon to decay based on the number of samples of experience that you've learned from; you want it to decay based on the number of actions you've actually taken in the environment. This should be independent of whether or not you're using Experience Replay. So, one possible explanation is that you're simply decaying epsilon too quickly in the case where Experience Replay is used.

Apart from that, it looks like you're still also performing direct learning steps on your most recent samples of experience, in addition to learning from older samples through Experience Replay. It is much more common to only learn from samples randomly taken out of your Experience Replay buffer, and not learn at all directly from the most recent samples. It is also much more common to simply story (s, a, r, s') tuples in the replay buffer independent of which episode they came from, and then take learning steps much more regularly than once every 25 actions.

Those are all differences between the more common implementation and your implementation that you could look into, but intuitively I don't expect them to explain the reduction in performance you're observing. Due to your differences with respect to the more common implementation, you're simply still much closer to the "no experience replay" setting, so I'd really expect you to just get very similar performance to that setting instead of worse performance.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • Hi Dennis, thank you for the rapid reply! Your initial impressions are correct. I set `batch_train` manually to compare ER-on to ER-off runs. Note that `epsilon` is not only decayed in ER but also during direct learning. `epsilon` decays based on successful episodes. I will test changes to my code based on your suggestions and report back! By `It is much more common to only learn from samples randomly taken`, do you mean that I shouldn't train my network with direct learning but only use direct learning as a source for my ER buffer? – Floris Apr 04 '18 at 12:54
  • @Floris Yes I noticed epsilon also decays outside of ER, that's no problem. The problem is that epsilon generally decays over time, where time is measured according to the amount of experience collected / actions taken. In your case, you measure "time" by the number of learning steps taken, which increases once you turn on ER. As for your second question, yes, that is the most common approach; simply store your samples of experience in buffer as you observe them, always take learning steps only on samples from the replay buffer (typically done every time step instead of every 25 time steps) – Dennis Soemers Apr 04 '18 at 13:13
  • I've made some changes to the code: https://pastebin.com/GActGcGk Direct learning is now only to add new experiences. After `pre_train_steps`, every `train_freq` steps (still tuning this) we train the network on 16 random samples from the buffer. `epsilon` decay is now calc'd every action, with a smaller step to prevent it from decaying too quickly. Another change I've made is to the random_uniform range, from 0,1 to 0 to 0.01. I was getting some Q-values >1 at the end which seemed off. Initial results do show improvement vs the earlier version of code... More testing to do! – Floris Apr 04 '18 at 15:19