Why doesn't my neural network Q-learner doesn't learn tic-tac-toe

Question

Okay, so I have created a neural network Q-learner using the same idea as DeepMind's Atari algorithm (except I give raw data not pictures (yet)).

Neural network build:

9 inputs (0 for empty spot, 1 for "X", -1 for "O")
1 hidden layer with 9-50 neurons (tried with different sizes, activation function sigmoid)
9 outputs (1 for every action, outputs Q-value, activation function sigmoid)
MSE loss function
Adam backprop

I'm 100% confident network is built correctly because of gradient checks and lots of tests.

Q-parameters:

-1 reward for lost game
-1 reward if move is attempted to already occupied spot (e.g. X is already in the spot where player O tries to put his "O")
0 reward for draws
0 reward for moves, which don't lead to terminal state
+1 reward for won game
Next state (in s,a,r,s') is the state after your own and your opponent's move. E.g. empty board and player X has first turn and puts "X" in upper left corner. Then player O puts "O" in upper right corner. Then s,a,r,s' would be s = [0,0,0,0,0,0,0,0,0], a = 0, r = 0, s' = [1,0,-1,0,0,0,0,0,0]

Problem

All my Q-values go to zero if I give -1 reward when move is made to already occupied spot. If I don't do it the network doesn't learn that it shouldn't make moves to already occupied places and seems to learn arbitrary Q-values. Also my error doesn't seem to shrink.

Solutions that didn't work

I have tried to change rewards to (0, 0.5, 1) and (0, 1) but it still didn't learn.
I have tried to present state as 0 for empty, 0.5 for O, 1 for X, but didn't work.
I have tried to give the next state straight after move is made but it didn't help.
I have tried with Adam and vanilla back prop, but still same results.
I have tried with batches from replay memory and stochastic gradient descent, but still the same
Changed sigmoid to ReLU but didn't help.
All kinds of things I can't recall now

Project in GitHub: https://github.com/Dopet/tic-tac-toe (Sorry for ugly code mostly due to all of these refactorings of code, also this was supposed to be easy test to see if the algorithm works)

Main points:

TicTac class has the game itself (made using template method pattern from abstract Game class)
NeuralNetwork class logs some data to file called MyLogFile.log in current directory
Block and Combo classes are just used to create the winning situations
jblas-1.2.4.jar contains the DoubleMatrix libraries

This posting is excellent as far as it goes. The problem is (a) I don't see anythign wrong with your approach; (b) you haven't provided code to reproduce the error. — Prune, Nov 30 '16 at 23:44
I added the project to GitHub. Please ask if you there is anything unclear! https://github.com/Dopet/tic-tac-toe — Dope, Dec 01 '16 at 10:31
[Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) applies here. — Prune, Dec 01 '16 at 16:57
There isn't really much I could remove from that. It only contains the tic-tac-toe and my AI. Both in separate packages. I also included tests if someone was interested. Files where the problem can be: TicTac, which contains the game; NeuralNetwork, which contains the neural network and NeuralQLearner, which uses the NeuralNetwork to provide Q-learning. — Dope, Dec 02 '16 at 10:28
I see two directories and three top-level files, none of which appears to be a script to reproduce the problem. You haven't provided instructions to that end. Please understand that there are thousands of people in this community asking for help; if you fail to provide the *complete* and *verifiable* components, you severely reduce the set of willing helpers -- often to the null set. — Prune, Dec 02 '16 at 17:46
Yeah I know that. That's why I didn't provide any code at first because I thought no one would be interested enough to see through it. But only thing to do in order to reproduce the problem is make a Java project out of those files with some java IDE add the jblas library and run Main file. Then check the content of the log file. I just think doing that would produce no extra information I haven't already provided. And I'm at complete dead-end here. I just can't figure out why it doesn't learn and I thought someone here might have a clue like try different loss function or something like that — Dope, Dec 02 '16 at 23:43
Rats. No, my gut feeling is that the problem is more basic, somewhere in the configuration parameters. If you are actually getting into a proper forward-backward mode, you should be seeing something better than a flat refusal to learn. — Prune, Dec 03 '16 at 00:02

score 1 · Accepted Answer · answered Jan 18 '17 at 14:44

1

It was a matter of rewards/removing activation function from the output layer. Most of the times I had rewards of [-1, 1] and my output layer activation function was sigmoid which goes from [0, 1]. This resulted the network to always have error when rewarding it with -1 because the output can never be less than zero. This caused the values go to zero since it tried to fix the error but it couldn't

answered Jan 18 '17 at 14:44

Dope

245
1
11

How did you go after fixing the error? I.e. how good did your network get at Tic Tac Toe? – Carsten Jul 09 '18 at 06:54

score 0 · Answer 2 · answered Jan 17 '17 at 14:58

I think your formulation is wrong. You are updating value of a state using the max value the NN gives for the next state.

expectedValue[i] = replay.getReward() + gamma *targetNetwork.forwardPropagate(replay.getNextState()).max();

This works for single player settings. But as tic tac toe is a 2 player game, higher value of the 'next state' (opponent) is bad for the value of the current state.

You could take the max value 2 states forward (using NN to predict 2 states forward), but that too doesn't work out well as you are assuming the second move you make is optimal and results in lots of wrong updates.

I would recommend you use policy gradients for such settings where propagating values is not very clear. In this approach you play random games (both players make random moves), and say if player 'O' wins you reward all 'O' moves positively (reduced by a discount factor ie the final move gets more reward and then reward decreases by a factor) and reward 'X' moves negatively in the same fashion. If the game results in a draw you can reward both players with a lesser positive reward.

You might end up rewarding sub optimal moves positively and vice versa but over a large number of games things work in your favor.

This shouldn't be a problem since my next state is the state after your own and your opponent's move (see Q-parameters section above for example of what I mean). But I actually got this working. It was a matter of rewards/removing activation function from the output layer. Most of the times I had rewards of [-1, 1] and my output layer activation function was sigmoid which goes from [0, 1]. This resulted the network to always have error when rewarding it with -1 because the output can never be less than zero. This caused the values go to zero since it tried to fix the error but it couldn't — Dope, Jan 18 '17 at 14:40

Why doesn't my neural network Q-learner doesn't learn tic-tac-toe

2 Answers2