1

I am training a reinforcement learning agent on an episodic task of fixed episode length. I am tracking the training process by plotting the cumulative rewards over an episode. I am using tensorboard for plotting the rewards. I have trained my agent for 20M steps. So I believe the agent has been given enough time to train. The cumulative rewards for an episode can range from +132 to around -60. My plot with a smoothing of 0.999

enter image description here

Over the episodes, I can see that my rewards have converged. But if I see the plot with smoothing of 0

enter image description here

There is a huge variation in the rewards. So should I consider that the agent has converged or not? Also I don't understand why is there such a huge variation in rewards even after so much of training?

Thanks.

chink
  • 1,505
  • 3
  • 28
  • 70
  • What task is the agent trying to solve? – nsidn98 Nov 29 '19 at 15:31
  • It is a control problem with episodic tasks of 9 hrs. Agent tries to maintain the temperature in a room by taking actions every 15 mins.If the action taken maintains the temperature in required range, agent gets positive rewards, if the action taken takes the temperature out of range, agent gets negative rewards based on how bad the temperature is – chink Dec 02 '19 at 06:25

0 Answers0