1

I was trying to implement Q-Learning with neural networks. I've got q-learning with a q-table working perfectly fine.

I am playing a little "catch the cheese" game.

It looks something like this:

# # # # # # # #
# . . . . . . #
# . $ . . . . #
# . . . P . . #
# . . . . . . #
# . . . . . . #
# . . . . . . # 
# # # # # # # #

The player p is spawning somewhere on the map. If he hits a wall, the reward will be negative. Lets call that negative reward -R for now.

If the player p hits the dollar sign, the reward will be positive. This positive reward will be +R

In both cases, the game will reset and the player will spawn somewhere randomly on the map.

My neural network architecture looks like this:

-> Inputsize:   [1, 8, 8]
   Flattening:  [1, 1, 64] (So I can use Dense layers)
   Dense Layer: [1, 1, 4]
-> Outputsize:  [1, 1, 4]

For the learning, I am storing some game samples in a buffer. The buffer maximum size is b_max.

So my training looks like this:

  1. Pick a random number between 0 and 1
  2. If the number is greater than the threshold, choose a random action.
  3. Otherwise pick the action with the highest reward.
  4. Take that action and observe the reward.
  5. Update my neural network by choosing a batch of game samples from the buffer

    5.1 Iterate through the batch and train the network as following:

    5.2 For each batch. The input to the network is the game state. (Everywhere 0, except at the players position).

    5.3 The output error of the output layer will be 0 everywhere except at the output neuron that is equal to the action that has been taking at that sample.

    5.4 Here the expected output will be: (the reward) + (discount_factor * future_reward) (future_reward = max (neuralNetwork(nextState))

    5.5 Do everything from the beginning.

The thing is that it just doesn't seem to work properly. I've an idea on how I could change this so it works but I am not sure if this is "allowed":

Each game decision could be trained until it does exactly what is supposed to do. Then I would go to the next decision and train on that and so on. How is the training usually done?

I would be very happy if someone could help and give me a detailed explanation on how the training works. Especially when it comes to "how many times do run what loop?".

Greetings, Finn

This is a map that shows what decision the neural network would like to do on each field:

# # # # # # # # # # 
# 1 3 2 0 2 3 3 3 # 
# 1 1 1 1 0 2 2 3 # 
# 0 0 $ 1 3 0 1 1 # 
# 1 0 1 2 1 0 3 3 # 
# 0 1 2 3 1 0 3 0 #  //The map is a little bit bigger but still it can be seen that it is wrong
# 2 0 1 3 1 0 3 0 #  //0: right, 1 bottom, 2 left, 3 top
# 1 0 1 0 2 3 2 1 # 
# 0 3 1 3 1 3 1 0 # 
# # # # # # # # # # 
Maxim
  • 52,561
  • 27
  • 155
  • 209
Finn Eggers
  • 857
  • 8
  • 21

0 Answers0