I was trying to implement Q-Learning with neural networks. I've got q-learning with a q-table working perfectly fine.
I am playing a little "catch the cheese" game.
It looks something like this:
# # # # # # # #
# . . . . . . #
# . $ . . . . #
# . . . P . . #
# . . . . . . #
# . . . . . . #
# . . . . . . #
# # # # # # # #
The player p is spawning somewhere on the map. If he hits a wall, the reward will be negative. Lets call that negative reward -R for now.
If the player p hits the dollar sign, the reward will be positive. This positive reward will be +R
In both cases, the game will reset and the player will spawn somewhere randomly on the map.
My neural network architecture looks like this:
-> Inputsize: [1, 8, 8]
Flattening: [1, 1, 64] (So I can use Dense layers)
Dense Layer: [1, 1, 4]
-> Outputsize: [1, 1, 4]
For the learning, I am storing some game samples in a buffer. The buffer maximum size is b_max.
So my training looks like this:
- Pick a random number between 0 and 1
- If the number is greater than the threshold, choose a random action.
- Otherwise pick the action with the highest reward.
- Take that action and observe the reward.
Update my neural network by choosing a batch of game samples from the buffer
5.1 Iterate through the batch and train the network as following:
5.2 For each batch. The input to the network is the game state. (Everywhere 0, except at the players position).
5.3 The output error of the output layer will be 0 everywhere except at the output neuron that is equal to the action that has been taking at that sample.
5.4 Here the expected output will be:
(the reward) + (discount_factor * future_reward) (future_reward = max (neuralNetwork(nextState))
5.5 Do everything from the beginning.
The thing is that it just doesn't seem to work properly. I've an idea on how I could change this so it works but I am not sure if this is "allowed":
Each game decision could be trained until it does exactly what is supposed to do. Then I would go to the next decision and train on that and so on. How is the training usually done?
I would be very happy if someone could help and give me a detailed explanation on how the training works. Especially when it comes to "how many times do run what loop?".
Greetings, Finn
This is a map that shows what decision the neural network would like to do on each field:
# # # # # # # # # #
# 1 3 2 0 2 3 3 3 #
# 1 1 1 1 0 2 2 3 #
# 0 0 $ 1 3 0 1 1 #
# 1 0 1 2 1 0 3 3 #
# 0 1 2 3 1 0 3 0 # //The map is a little bit bigger but still it can be seen that it is wrong
# 2 0 1 3 1 0 3 0 # //0: right, 1 bottom, 2 left, 3 top
# 1 0 1 0 2 3 2 1 #
# 0 3 1 3 1 3 1 0 #
# # # # # # # # # #