3

I'm trying to come up with a better representation for the state of a 2-d grid world for a Q-learning algorithm which utilizes a neural network for the Q-function.

In the tutorial, Q-learning with Neural Networks, the grid is represented as a 3-d array of integers (0 or 1). The first and second dimensions represent the position of an object in the grid world. The third dimension encodes which object it is.

So, for a 4x4 grid with 4 objects in it, you would represent the state with a 3-d array with 64 elements in it (4x4x4). This means that the meural network would have 64 nodes in the input layer so it could accept the state of the grid world as input.

I want to reduce the number of nodes in the Neural Network so that training does not take as long. So, can you represent the grid world as 2-d array of doubles instead?

I tried to represent a 4x4 grid world as a 2-d array of doubles and used different values to represent different objects. For example, I used 0.1 to represent the player and 0.4 to represent the goal. However, when I implemented this the algorithm stopped learning at all.

Right now I think my problem might be that I need to change which activation functions I'm using in my layers. I'm presently using the hyperbolic tangent activation function. My inputs values range from (0 - 1). My output values range from (-1 to 1). I've also tried the sigmoid function.

I realize this is a complex problem to be asking a question about. Any suggestion as to architecture of the network would be appreciated.

UPDATE

There are three variants to the game: 1. The world is static. All objects start in the same place. 2. The player starting position is random. All other objects stay the same. 3. Each grid is totally random.

With more testing I discovered I can complete the first two variants with my 2d array representation. So I think my network architecture might be fine. What I discovered is that my network is now extraordinarily susceptible to catastrophic forgetting (much more so than when I was using the 3d array). I have to use "experience replay" to make it learn, but even then I still can't complete the third variant. I'll keep trying. I'm rather shocked how much of a difference changing the grid world representation made. It hasn't improved performance at all.

warren.sentient
  • 513
  • 6
  • 10
Galen
  • 499
  • 5
  • 14
  • Did you resolve this? I am doing tic tac toe, with grid 3 by 3. Instead of having a 2 by 3 by 3 state tensor to model X and O, I have also tried using a single 3 by 3 state tensor, with 0 representing 'X', and 1 representing "0", But the system is not learning at all. – Rowan Gontier Jan 23 '22 at 23:50

1 Answers1

2

Some standard representations are:

  • Polynomial (usually 1st or 2nd degree): for the 1st degree you will have a 3-dimensional vector, where the first element is the bias (degree 0), the second is the x coordinate and the third is the y coordinate. For higher degrees, you'll have also x^2, y^2, xy .... If the environment changes, you have also to do the same with the objects positions.

  • Radial basis functions (or tile coding since the state space is discrete): you'll have a N x N vector (N is the size of the environment) and each basis / tile will tell you if the agent is in the corresponding cell. You can also have less bases / tiles, each one covering more than one cell. Then you can append a polynomial for the objects in the environment (if their location changes).

Anyway, a 64-dimensional input should not be a problem for a NN. I am not sure that a tanh is the best non-linear function to use. If you read the famous DeepMind paper you'll see that they use a rectified linear activation (why? read this).

Also, be sure to use a gradient descent optimizer during backpropagation.

EDIT

There is basically no difference between the 1st and the 2nd version (actually, having a random agent initial position might even speed up the learning). The third version is of course more difficult, as you have to include details about the environment in your state representation.

Anyway, the features I suggest are still the same: polynomial or radial basis functions.

Experience replay is almost mandatory, as described in the DeepMind paper I cite above. Also, you might find beneficial using a second deep network as target for the Q-function. I don't think this is suggested in the tutorial (I might have missed it). Basically the target r + max(a) gamma * Q(s', a) is given by a different network than the Q-network used by your policy. Every C step you copy the parameters of your Q-network to your Q-target-network in order to give consistent targets during temporal difference backups. These two tricks (experience replay with minibatches and having a separate target network) are what made Deep Q-learning successful. Again, refer to the DeepMind paper for details.

Finally, some crucial aspects you might want to check:

  • how big are your minibatches?
  • how explorative is your policy?
  • how many samples with a random policy do you collect before starting to learn?
  • how much are you waiting? (it can easily need 500k samples in total to learn)
warren.sentient
  • 513
  • 6
  • 10
Simon
  • 5,070
  • 5
  • 33
  • 59
  • Yes, the environment does change. I updated my question. – Galen Apr 27 '16 at 14:19
  • @Galen I have edit my answer as well. I stress this again: try using `ReLU` instead of `tanh`. And did you use a gradient descent optimizer? Try Adam or RMSprop. – Simon Apr 27 '16 at 16:56
  • I'm current testing using a mini-batch size of 500 randomly taken from a pool of 2000 unqiue experiences. As for how explorative it is I start epsilon at 1.0 and slowly decay to 0.1. I usually wait until around 30K episodes. If there is no improvement by then I stop it. It would days for it to get to 500k (not using GPU) – Galen Apr 27 '16 at 20:06
  • @Galen 2000 is very small and 500 very big. In the Atari Games paper they use minibatches of size 32 and a dataset of 1M unique samples. And they learn much more complex problems than a gridworld :D The problem with your setting is that you use 1/4 of your samples at each step, so you use the same data too often. I would increase the dataset size to 100k and use minibatches of 32 samples (or ~100 at most). The other parameters seem fine. – Simon Apr 27 '16 at 20:06
  • @Galen And how much do you wait before learning? You should first collect some samples with a random policy (keep epsilon to 1 without any decay) and then start learning (and decaying epsilon). I had tremendous improvement when I increased this "idle time" when I played around with my hyperparams. Oh, and when I said "500k samples" I meant steps, not episodes. 30k episodes (I assume of more or less 100 steps each) are fine. – Simon Apr 27 '16 at 20:13
  • Thanks, obviously I need to go over the docs you linked. It will take me a while to do that and try out some of your suggestions. AFAIK, Encog .Net does not support ReLU. Also, not sure if Encog uses gradient decent optimization. I'll have to read more to understand it. I changed the sampling/mini batches per your suggestions. I will know in the morning if it helps. Looks promising so far. – Galen Apr 27 '16 at 21:19