0

I am trying to train a DQN Agent to solve AI Gym's Cartpole-v0 environment. I have started with this person's implementation just to get some hands-on experience. What I noticed is that during training, after many episodes the agent finds the solution and is able to keep the pole upright for the maximum amount of timesteps. However, after further training, the policy looks like it becomes more stochastic and it can't keep the pole upright anymore and goes in and out of a good policy. I'm pretty confused by this why wouldn't further training and experience help the agent? At episodes my epsilon for random action becomes very low, so it should be operating on just making the next prediction. So why does it on some training episodes fail to keep the pole upright and on others it succeeds?

Here is a picture of my reward-episode curve during the training process of the above linked implementation.

enter image description here

alex
  • 1,905
  • 26
  • 51

1 Answers1

1

This actually looks fairly normal to me, in fact I guessed your results were from CartPole before reading the whole question.

I have a few suggestions:

  • When you're plotting results, you should plot averages over a few random seeds. Not only is this generally good practice (it shows how sensitive your algo is to seeds), it'll smooth out your graphs and give you a better understanding of the "skill" of your agent. Don't forget, the environment and the policy are stochastic, so it's not completely crazy that your agent exhibits this type of behavior.
  • Assuming you're implementing e-greedy exploration, what's your epsilon value? Are you decaying it over time? The issue could also be that your agent is still exploring a lot even after it found a good policy.
  • Have you played around with hyperparameters, like learning rate, epsilon, network size, replay buffer size, etc? Those can also be the culprit.
harwiltz
  • 261
  • 3
  • 1
  • Hello - okay good to hear that this looks normal. I start with epsilon = 1, use a decay factor of 0.995, and I don't limit how small epsilon can be hoping that at some point the agent doesn't need to make random actions anymore. I just checked my implementation, in the later episodes > 4500 the epsilon is 1.4e-11 which is really small. I haven't played around much with the other parameters you mention - I'm hoping to gain a theoretical understanding so I can understand what parameters should be set. Is that even possible? Or is the common practice to find them experimentally? – alex Jun 23 '20 at 13:57
  • You'll probably just need to experiment unfortunately, hyperparameter tuning in RL is notoriously rough – harwiltz Jun 23 '20 at 17:27
  • so I was actually able to find the term for this - it's called [catastrophic interference](https://en.wikipedia.org/wiki/Catastrophic_interference#:~:text=Catastrophic%20interference%2C%20also%20known%20as,connectionist%20approach%20to%20cognitive%20science.) – alex Jun 24 '20 at 14:00