Why are my rewards converging but still have a lot of variations

Asked Nov 29 '19 at 10:30

Active Nov 29 '19 at 10:30

Viewed 181 times

I am training a reinforcement learning agent on an episodic task of fixed episode length. I am tracking the training process by plotting the cumulative rewards over an episode. I am using tensorboard for plotting the rewards. I have trained my agent for 20M steps. So I believe the agent has been given enough time to train. The cumulative rewards for an episode can range from +132 to around -60. My plot with a smoothing of 0.999

Over the episodes, I can see that my rewards have converged. But if I see the plot with smoothing of 0

There is a huge variation in the rewards. So should I consider that the agent has converged or not? Also I don't understand why is there such a huge variation in rewards even after so much of training?

Thanks.

asked Nov 29 '19 at 10:30

chink

1,505
3
28
70

What task is the agent trying to solve? – nsidn98 Nov 29 '19 at 15:31
It is a control problem with episodic tasks of 9 hrs. Agent tries to maintain the temperature in a room by taking actions every 15 mins.If the action taken maintains the temperature in required range, agent gets positive rewards, if the action taken takes the temperature out of range, agent gets negative rewards based on how bad the temperature is – chink Dec 02 '19 at 06:25

Why are my rewards converging but still have a lot of variations

0 Answers0