2

I'm experimenting with deep q learning using Keras , and i want to teach an agent to perform a task .

in my problem i wan't to teach an agent to avoid hitting objects in it's path by changing it's speed (accelerate or decelerate)

the agent is moving horizontally and the objects to avoid are moving vertically and i wan't him to learn to change it's speed in a way to avoid hitting them . i based my code on this : Keras-FlappyBird

i tried 3 different models (i'm not using convolution network)

  1. model with 10 dense hidden layer with sigmoid activation function , with 400 output node

  2. model with 10 dense hidden layer with Leaky ReLU activation function

  3. model with 10 dense hidden layer with ReLu activation function, with 400 output node

and i feed to the network the coordinates and speeds of all the object in my word to the network .

and trained it for 1 million frame but still can't see any result here is my q-value plot for the 3 models ,

Model 1 : q-value enter image description here Model 2 : q-value

enter image description here Model 3 : q-value

enter image description here Model 3 : q-value zoomed

enter image description here

as you can see the q values isn't improving at all same as fro the reward ... please help me what i'm i doing wrong ..

un famous
  • 35
  • 1
  • 11

1 Answers1

1

I am a little confused by your environment. I am assuming that your problem is not flappy bird, and you are trying to port over code from flappy bird into your own environment. So even though I don't know your environment or your code, I still think there is enough to answer some potential issues to get you on the right track.

First, you mention the three models that you have tried. Of course, picking the right function approximation is very important for generalized reinforcement learning, but there are so many more hyper-parameters that could be important in solving your problem. For example, there is the gamma, learning rate, exploration and exploration decay rate, replay memory length in certain cases, batch size of training, etc. With your Q-value not changing in a state that you believe should in fact change, leads me to believe that limited exploration is being done for models one and two. In the code example, epsilon starts at .1, maybe try different values there up to 1. Also that will require messing with the decay rate of the exploration rate as well. If your q values are shooting up drastically across episodes, I would also look at the learning rate as well (although in the code sample, it looks pretty small). On the same note, gamma can be extremely important. If it is too small, you learner will be myopic.

You also mention you have 400 output nodes. Does your environment have 400 actions? Large action spaces also come with their own set of challenges. Here is a good white paper to look at if indeed you do have 400 actions https://arxiv.org/pdf/1512.07679.pdf. If you do not have 400 actions, something is wrong with your network structure. You should treat each of the output nodes as a probability of which action to select. For example, in the code example you posted, they have two actions and use relu.

Getting the parameters of deep q learning right is very difficult, especially when you account for how slow training is.

Derek_M
  • 1,018
  • 10
  • 22
  • thank you so much for your answer , 1: by 400 node i mean in the hidden nodes , in the output node i only have 3 . 2 : yes you are correct i'm porting the code for my own environment. 3:to explain my environment ; i'm basically trying to train a network to avoid collision with moving targets , i'm feeding the positions of the target as input and the output is 3 possible actions – un famous May 10 '17 at 17:35
  • and i think you are correct , i tried to change the parameters of epsilon and gamma and i see some improvement but still not the result i hoped for – un famous May 10 '17 at 17:37
  • I would also look at the other parameters as well, including the learning rate. If your environment doesn't have a terminal state, you may need to consider having a minimum exploration rate of .1 or something similar, so that it is constantly exploring new states, finding a somewhat optimal solution. – Derek_M May 10 '17 at 17:53
  • 1
    I have had experiments that took me nearly 2 weeks to find the optimal parameters. DQN tuning can be extremely painful when experimenting with an especially large MDP (or infinite/ partially observable). – Derek_M May 10 '17 at 17:54
  • what are the three actions in your case? My understanding is that there can be only two actions up and down. – Bunny Rabbit Mar 28 '18 at 09:04