Hi I am training reinforcement learning agents for a control problem using PPO algorithm. I am tracking the accumulated rewards for each episode during the training process. Several times during the training process I see a sudden dip in the accumulated rewards. I am not able to figure out why this is happening or how to avoid this. Tried with changing some of the hyper parameters like changing the number of neurons in the neural network layers, learning rate etc.. but still I see this happening consistently. If I debug and check the actions that are being taken during dips, obviously actions are very bad hence causing a decrease in rewards.
Can some one help me with understanding why this is happening or how to avoid this ?
Some of plots of my training process