I have taken some reference implementations of PPO algorithm and am trying to create an agent which can play space invaders . Unfortunately from the 2nd trial onwards (after training the actor and critic N Networks for the first time) , the probability distribution of the actions converges on only action and the PPO loss and the critic loss converges on only one value.
Wanted to understand the probable reasons why this might occur . I really cant run the code in my cloud VMs without being sure that I am not missing anything as the VMs are very costly to use . I would appreciate any help or advice in this regarding .. if required I can post the code as well . Hyperparameters used are as follows :
clipping_val = 0.2 critic_discount = 0.5 entropy_beta = 0.001 gamma = 0.99 lambda = 0.95