4

I am toying around with Machine learning. Especially Q-Learning where you have a state and actions and give rewards depending on how well the network did.

Now for starters I set myself a simple goal: Train a network so it emits valid moves for tic-tac-toe (vs a random opponent) as actions. My problem is that the network does not learn at all or even gets worse over time.

The first thing I did was getting in touch with torch and a deep q learning module for it: https://github.com/blakeMilner/DeepQLearning .

Then I wrote a simple tic-tac-toe game where a random player competes with the neural net and plugged this into the code from this sample https://github.com/blakeMilner/DeepQLearning/blob/master/test.lua . The output of the network consists of 9 nodes for setting the respective cell.

A move is valid if the network chooses an empty cell (no X or O in it). According to this I give positive reward (if network chooses empty cell) and negative rewards (if network chooses an occupied cell).

The problem is it never seems to learn. I tried lots of variations:

  • mapping the tic-tac-toe field as 9 inputs (0 = cell empty, 1 = player 1, 2 = player 2) or as 27 inputs (e.g. for an empty cell 0 [empty = 1, player1 = 0, player2 = 0])
  • vary the hidden nodes count between 10 and 60
  • tried up to 60k iterations
  • varying learning rate between 0.001 and 0.1
  • giving negative rewards for fails or only rewards for success, different reward values

Nothing works :(

Now I have a couple of questions:

  1. Since this is my very first attempt at Q-Learning is there anything I am fundamentally doing wrong?
  2. What parameters are worth changing? The "Brain" thing has a lot: https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L57 .
  3. What would a good count for the number of hidden nodes be?
  4. Is the simple network structure as defined at https://github.com/blakeMilner/DeepQLearning/blob/master/deepqlearn.lua#L116 too simple for this problem?
  5. Am I just too impatient and have to train much more iterations?

Thank you,

-Matthias

nitrogenycs
  • 962
  • 9
  • 13
  • Could you post your code? – John Wakefield Feb 08 '16 at 15:29
  • My suggestion: in a first step forget about the neural network and stick to a tabular representation of the value of all tic-tac-toe states. Their number is `3^9 = 19683`. Using Q-Learning, one gets a bit more states due to the <= 9 actions, but still easily storable. Only thereafter, I would first go on with linear regression and then with a network. Then you also can compare the approximate to exact results. – davidhigh Feb 12 '16 at 00:42
  • @John Wakefield: I am currently creating a simplified version of it (2x1 grid instead of 3x3 to minimize the state space). I'll send the code when it's done. – nitrogenycs Feb 14 '16 at 15:38
  • @davidhigh: I'll read upon linear regression (in the AI sense). My main goal however is to have a small testbed for learning Q-Learning and then applying it to a much harder problem I have in mind. – nitrogenycs Feb 14 '16 at 15:39

1 Answers1

2

Matthias,

It seems you are using one output node? "The output of the network in the forward step is a number between 1 and 9". If so, then I believe this is the problem. Instead of having one output node I would treat this as a classification problem and have nine output nodes corresponding to each board position. Then take the argmax of these nodes as the predicted move. This is how networks that play the game of Go are setup (there are 361 output nodes each representing an intersection on the board).

Hope this helps!

islandman93
  • 410
  • 4
  • 4
  • Thank you for your answer. I am actually using 9 output nodes, I didn't phrase this well in my original question. I've edited it now for clarity. – nitrogenycs Feb 14 '16 at 15:37