I have made a simple Tron game in C++ and a MLP with one hidden layer. I have implemented Q-learning in this neural network, however, it is not causing the agent to win more games over time (even after 1 million games). I will try to explain in text what I did, hopefully someone can spot a mistake, which might cause this problem.
At every state there are four possible moves (north, east, south, west) and the rewards are at the end of the game (-1 for loss, 0 for draw, 1 for win).
I initialise 4 MLPs, one for every possible action, with 100 input nodes (the entire game grid 10x10) where every point is 1 if the player itself is there, 0 if the point is empty, and -1 if the opponent has visited this point. Then there are 50 hidden nodes and 1 output node (I have also tried one network with 4 output nodes, but also this does not help). The weights are randomly chosen between -0.5 and 0.5
At every epoch I initialise the game environment with the 2 agents randomly placed in the grid. Then I run the game in while loop until the game is over and then reset the game environment. Within this while loop, I do the following.
- I supply the MLP with the current state and determine the highest Q-Value and go there with a 90% chance (10% random movement). The Q-value is determined using a sigmoid or RELU activation function (I have tried both).
- Then I calculate in the new state 4 Q-values and use this to train the network of my first move, with the following target: Target = reward + gamma*(maxQnextState). Then the error = Target - qValue calculated at previous state.
- I use back propagation with the derivative of the sigmoid function and a high learning rate and momentum term to propagate the error backwards.
It seems as if my qValues are either very low (in the order 0.0001) or very close to 1 (0.999). And if I look at the error term at every 10.000 games, it does not seem to be decreasing.
I started with a MLP that could learn the XOR function, and use this for Q-learning now. Maybe some of the underlying assumptions in the XOR case are different and cause the problem for Q-learning?
Or maybe it is the sparse input (just 100 times a 0, 1 or -1) that makes it impossible to learn?
Suggestions are really appreciated!