3

I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.

Details:

  • The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided

  • The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over

  • The decay-rate (gamma ) is 0.95

  • epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01

  • the replay memory is of size 10,000

  • The architecture of the conv net is:

    • input layer of size screen_pixels
    • conv layer 1 with 32 filters with kernel (8,8) and stride (4,4), relu activation function and is padded to be the same size on output as input
    • conv layer 2 with 64 filters with kernel (4,4) and stride (2,2), relu activation function and is padded to be the same size on output as input
    • conv layer 3 with 64 filters with kernel (3,3) and stride (1,1), relu activation function and is padded to be the same size on output as input
    • a flatten layer (this is to change the shape of the data to allow it to then feed into a fully connected layer)
    • Fully connected layer with 512 nodes and relu activation function
    • An output fully connected layer with 2 nodes (the action space)
  • The learning rate of the convolutional neural network is 0.0001
  • The code has been developed in keras and uses experience replay and double deep q learning
  • The original image is reduced from (400, 600, 3) to (60, 84, 4) by greyscaling, resizing, cropping and then stacking 4 images together before providing this to the conv net
  • The target network is updated every 2 online network updates.

1 Answers1

1

Providing a positive reward of 0.1 on every step as long as the game is not over may make the game over -1 punishment almost irrelevant. Particularly considering the discount factor that you are using.

It is difficult to judge without looking at your source code but I would initially suggest you to provide only a negative reward at the end of the game and remove positive rewards.

Juan Leni
  • 6,982
  • 5
  • 55
  • 87
  • Thank you for your answer. I can see how this could be a big problem in the long run and your suggestion could solve problems further down the line (in terms of learning). But at the moment, the algorithm tends to only take around 12 actions per episode (never more than 30). The Q values themselves are rather confusing given the reward function; they tend to be in the range -1.8 < q_values < -1. I do not understand how the value could ever become smaller than -1 given the reward function, but perhaps this is a clue to the problem. – MichaelAndroidNewbie Aug 03 '17 at 00:46
  • I could clean and upload my code to a public github repository or paste it all into my question if you would like to look, but it is around 250 lines of code – MichaelAndroidNewbie Aug 03 '17 at 00:47
  • 1
    When updating the Q value you are bootstrapping and using an estimation from the neural network. At the very beginning, the values generated by the network can be very much less than -1. It might take take some time until the network provides proper estimations. – Juan Leni Aug 03 '17 at 00:51
  • But the losses are so small that, at least to my understanding, the network has learnt the values of those particular states but they are erroneous values. I should mention that the algorithm had been run with different settings (experience replay, loss function etc.) on two different machines, overnight, and had performed around 80,000 actions each (I understand that learning a good policy can take much longer than this, but it is confusing behaviour nonetheless). – MichaelAndroidNewbie Aug 03 '17 at 01:09