PPO algorithm converges on only one action

Question

I have taken some reference implementations of PPO algorithm and am trying to create an agent which can play space invaders . Unfortunately from the 2nd trial onwards (after training the actor and critic N Networks for the first time) , the probability distribution of the actions converges on only action and the PPO loss and the critic loss converges on only one value.

Wanted to understand the probable reasons why this might occur . I really cant run the code in my cloud VMs without being sure that I am not missing anything as the VMs are very costly to use . I would appreciate any help or advice in this regarding .. if required I can post the code as well . Hyperparameters used are as follows :

clipping_val = 0.2 critic_discount = 0.5 entropy_beta = 0.001 gamma = 0.99 lambda = 0.95

Would need to see some code please. Are you using a raw pixels approach using a CNN? If so have you taken this from a tutorial? — MarcusRenshaw, May 03 '20 at 17:44
Yes i am taking a raw pixel approach . I have taken this from multiple tutorials actually and embedded some of my own code as well — JAYDEEP GHOSE, May 03 '20 at 18:37
code repo : https://github.com/superchiku/ReinforcementLearning . The code is still very unrefined but it has all the core logic in it . The problem remains the same no matter which atari game I use , i have tried this on pacman and space invaders — JAYDEEP GHOSE, May 03 '20 at 18:40

score 0 · Answer 1 · answered May 06 '20 at 06:24

One of the reasons could be that you are not normalising the inputs to the CNN in the range [0,1] and thus saturating your neural networks. I suggest that you use the preprocess() function in your code to transform your states(inputs) to the network.

def preprocess(self,img):
    width = img.shape[1]
    height = img.shape[0]
    dim = (abs(width/2), abs(height/2))
    resized = cv2.resize(img,(80,105) ) #interpolation = cv2.INTER_AREA)
    resized = resized/255.0 # convert all pixel values in [0,1] range
    resized = cv2.cvtColor(resized, cv2.COLOR_BGR2GRAY)
    resized = resized.reshape(resized.shape+(1,))
    return resized

Thank you for this tip. Yes i started using this but still I am not getting the desirable performance ... What else do you think can be improved? I have also noticed that the critic reward prediction also stagnates between 0.99 & 1 ... — JAYDEEP GHOSE, May 06 '20 at 18:06

PPO algorithm converges on only one action

1 Answers1