3

I have some trouble with my implementation of a deep neural network to the game Pong because my network is always diverging, regardless which parameters I change. I took a Pong-Game and implemented a theano/lasagne based deep-q learning algorithm which is based on the famous nature paper by Googles Deepmind.

What I want:
Instead of feeding the network with pixel data I want to input the x- and y-position of the ball and the y-position of the paddle for 4 consecutive frames. So I got a total of 12 inputs.
I only want to reward the hit, the loss, and the win of a round.
With this configuration, the network did not converge and my agent was not able to play the game. Instead, the paddle drove directly to the top or bottom or repeated the same pattern. So I thought I try to make it a bit easier for the agent and add some information.

What I did:
States:

  • x-position of the Ball (-1 to 1)
  • y-position of the Ball (-1 to 1)
  • normalized x-velocity of the Ball
  • normalized y-velocity of the Ball
  • y-position of the paddle (-1 to 1)

With 4 consecutive frames I get a total input of 20.

Rewards:

  • +10 if Paddle hits the Ball
  • +100 if Agent wins the round
  • -100 if Agent loses the round
  • -5 to 0 for the distance between the predicted end position (y-position) of the ball and the current y-position of the paddle
  • +20 if the predicted end position of the ball lies in the current range of the paddle (the hit is foreseeable)
  • -5 if the ball lies behind the paddle (no hit possible anymore)

With this configuration, the network still diverges. I tried to play around with the learning rate (0.1 to 0.00001), the nodes of the hidden layers (5 to 500), the number of hidden layers (1 to 4), the batch accumulator (sum or mean), the update rule (rmsprop or Deepminds rmsprop).
All of these did not lead to a satisfactory solution. The graph of the loss averages mostly looks something like this. You can download my current version of the implementation here
I would be very grateful for any hint :)
Koanashi

chron0x
  • 875
  • 9
  • 19
  • Since I have not enough reputation points to post more than two links I want to provide them here: [Pong-Game](http://pygame.org/project-py-pong-2040-.html); [Theano/Lasagne implementation](https://github.com/spragunr/deep_q_rl); [Nature paper](https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf); [Another loss plot](http://i.imgur.com/U5ZBLcQ.png); – chron0x Sep 07 '16 at 13:28
  • have you tried using lower reward values? If possible, I'd recommend trying to normalize all rewards to lie in [0.0, 1.0] or [-1.0, 1.0] based on the minimum and maximum possible values using the rewards you are currently using. If it is difficult to determine those minimum and maximum values, it may still at least help to bring everything closer to 0 (maybe divide all the rewards you're using right now by 100?). This may help the network to converge more quickly. – Dennis Soemers Sep 07 '16 at 16:57
  • 1
    Thanks, I haven't yet, but I will know and then report. – chron0x Sep 07 '16 at 18:20
  • 1
    Wow! That's it! @DennisSoemers thank you very much! The network converges and the right player learns to play Pong. I will now try to adjust the game, that I get the conditions I want. Really great!!! :) – chron0x Sep 08 '16 at 06:23

1 Answers1

2

Repeating my suggestion from comments as an answer now to make it easier to see for anyone else ending up on this page later (was posted as comment first since I was not 100% sure it'd be the solution):

Reducing the magnitude of the rewards to lie in (or at least close to) the [0.0, 1.0] or [-1.0, 1.0] intervals helps the network to converge more quickly.

Changing the reward values in such a way (simply dividing them all by a number to make them lie in a smaller interval) does not change what a network is able to learn in theory. The network could also simply learn the same concepts with larger rewards by finding larger weights throughout the network.

However, learning such large weights typically takes much more time. The main reason for this is that weights are often intialized to random values close to 0, so it takes a lot of time to change those values to large values through training. Because the weights are initialized to small values (typically), and they are very far away from the optimal weight values, this also means that there is an increased risk of there being a local (not a global) minimum along the way to the optimal weight values, which it can get stuck in.

With lower reward values, the optimal weight values are likely to be low in magnitude as well. This means that weights initialized to small random values are already more likely to be close to their optimal values. This leads to a shorter training time (less "distance" to travel to put it informally), and a decreased risk of there being local minima along the way to get stuck in.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • After the first version worked fine, I removed the reward which calculates the distance between the predicted end position and the paddle. Now my [losses](http://i.imgur.com/6U3zKlT.png) first decrease but the increase again. With 4 consecutive frames, I still have 20 inputs. I chose one hidden layer with 80 nodes and I have 3 outputs(up, stay, down). Did I ran in overfitting, or do I need more than one hidden layer? Or why are the losses increasing again? – chron0x Sep 09 '16 at 06:25
  • in a standard setting with supervised learning (training the neural network based on example situations where it is known what the optimal output is), I'd recommend to not only plot the loss of the validation/test data, but also plot the loss of the training data. Then, if the loss of the training data keeps going down, but the loss on validation data increases, you know you're overfitting. I believe your setting is different though, and you don't have training data? Maybe this information gives you some useful ideas though. – Dennis Soemers Sep 09 '16 at 07:29
  • @Koanashi for a more useful reply, I'd first have to read the paper in detail to get a better idea of exactly how it works in this setting. Now I don't mind doing that since it's interesting and I really probably should read it sometime anyway, but it's gonna take a bit of time – Dennis Soemers Sep 09 '16 at 07:30
  • @Koanashi I've read the paper in a bit more detail now. Is it correct that your ''training data'' (set of experiences) grows over time? Because I believe you play a bit, then train a bit, then play a bit, then train a bit again, etc.? In that case, I assume that initially the training set is relatively small, and therefore it is easier to obtain a low loss. When the set grows, it becomes more difficult for the network to still match the observations as well, so I don't think it's strange necessarily if the loss grows a bit. (continuing in next comment due to char limit) – Dennis Soemers Sep 09 '16 at 09:11
  • your loss should eventually stabilize though. From the last image you linked, it looks like it might start stabilizing around 0.03. It's difficult to tell though, would need more epochs to be sure. I think the most important question is; does your agent still play well? if so, there's no problem in the loss increasing a bit, as long as it doesn't keep going up forever – Dennis Soemers Sep 09 '16 at 09:12
  • Yes you store all transitions ("experiences") in a replay memory with a size of 1000000. Out of this replay memory, you take a random minibatch of 32 to train your network. I let the program run with one hidden layer with 30 nodes and the losses lay converged to 0.03 again. The agent approaches the ball but is not precise enough for the most of them. So I started the simulation again, in the hope that it will get more precise. I can see the results tomorrow. I also recommend this [paper](http://arxiv.org/pdf/1312.5602v1.pdf) which was published before the one in nature. – chron0x Sep 12 '16 at 13:32