Okay, so I have created a neural network Q-learner using the same idea as DeepMind's Atari algorithm (except I give raw data not pictures (yet)).
Neural network build:
9 inputs (0 for empty spot, 1 for "X", -1 for "O")
1 hidden layer with 9-50 neurons (tried with different sizes, activation function sigmoid)
9 outputs (1 for every action, outputs Q-value, activation function sigmoid)
- MSE loss function
- Adam backprop
I'm 100% confident network is built correctly because of gradient checks and lots of tests.
Q-parameters:
- -1 reward for lost game
- -1 reward if move is attempted to already occupied spot (e.g. X is already in the spot where player O tries to put his "O")
- 0 reward for draws
- 0 reward for moves, which don't lead to terminal state
- +1 reward for won game
- Next state (in s,a,r,s') is the state after your own and your opponent's move. E.g. empty board and player X has first turn and puts "X" in upper left corner. Then player O puts "O" in upper right corner. Then s,a,r,s' would be s = [0,0,0,0,0,0,0,0,0], a = 0, r = 0, s' = [1,0,-1,0,0,0,0,0,0]
Problem
All my Q-values go to zero if I give -1 reward when move is made to already occupied spot. If I don't do it the network doesn't learn that it shouldn't make moves to already occupied places and seems to learn arbitrary Q-values. Also my error doesn't seem to shrink.
Solutions that didn't work
I have tried to change rewards to (0, 0.5, 1) and (0, 1) but it still didn't learn.
I have tried to present state as 0 for empty, 0.5 for O, 1 for X, but didn't work.
I have tried to give the next state straight after move is made but it didn't help.
I have tried with Adam and vanilla back prop, but still same results.
- I have tried with batches from replay memory and stochastic gradient descent, but still the same
- Changed sigmoid to ReLU but didn't help.
- All kinds of things I can't recall now
Project in GitHub: https://github.com/Dopet/tic-tac-toe (Sorry for ugly code mostly due to all of these refactorings of code, also this was supposed to be easy test to see if the algorithm works)
Main points:
- TicTac class has the game itself (made using template method pattern from abstract Game class)
- NeuralNetwork class logs some data to file called MyLogFile.log in current directory
- Block and Combo classes are just used to create the winning situations
- jblas-1.2.4.jar contains the DoubleMatrix libraries