1

I am trying to train a Neural Net on playing Tic Tac Toe via Reinforcement Learning with Keras, Python. Currently the Net gets an Input of the current board:

    array([0,1,0,-1,0,1,0,0,0])
1 = X 
-1 = O
0 = an empty field

If the Net won a game it gets a reward for every action(Output) it did. [0,0,0,0,1,0,0,0,0] If the Net loses I want to train it with a bad reward. [0,0,0,0,-1,0,0,0,0]

But currently I get a lot of 0.000e-000 accuracies.

Can I train a "bad reward" at all? Or if can't do it with -1 how should I do it instead?

Thanks in advance.

Bikramjeet Singh
  • 681
  • 1
  • 7
  • 22
nailuj05
  • 37
  • 7

1 Answers1

3

You need to backpropagate the reward won at the end of the game. Have a look at this tutorial.

In short, from this tutorial :

# at the end of game, backpropagate and update states value
def feedReward(self, reward):
    for st in reversed(self.states):
        if self.states_value.get(st) is None:
            self.states_value[st] = 0
        self.states_value[st] += self.lr * (self.decay_gamma * reward 
                    - self.states_value[st])
        reward = self.states_value[st]

As you can see, the reward in the step let's say 5 (end of the game) is backpropagated (not in the derivative sense) throught all steps before (4,3,2,1) with a decay rate. This is the way to go because tic-tac-toe is a game with a delayed reward, as opposed to classic reinforcement learning environments, where we usually have a reward (positive or negative) at each step. Here the reward of action at T depends on the final action at T+something. This final action gives a reward of 1 if it ended the game with a win, or a reward of -1 if the opponent played the last action and won.

As for the accuracy, we don't use it as a metric in reinforcement learning. A good metric would be to observe the mean cumulative reward (which will be 0 if your agent wins half of the time, > 0 if it has learned something, or < 0 otherwise).

Dany Yatim
  • 96
  • 5
  • So I train in let's say the actions(if won) with a reward of 1 , 0.8, 0.6, .... last to first? – nailuj05 Jan 05 '20 at 12:19
  • That's the idea :) and the amount propagated throught past actions will depend on the decay_rate. It makes sense : the more you go back in time, the less an action is determining, and the less it is responsible for what happens in T + N. Following this idea, its contribution is smaller than last action, so is its reward. – Dany Yatim Jan 05 '20 at 13:43
  • Would you decay the negative reward too? And what about the first action. As far as I know there is no "bad" first action. would you still reward it? – nailuj05 Jan 06 '20 at 10:30
  • Yes, same logic for negative rewards. As for the first action, it depends on how you consider your game : you could easely don't assign any reward, so that your agent will keep playing randomly at step 0. However, if your opponent is statistically better, for some reason, when your action at step 0 is let's say (2,2), it could be better to propagate the reward back to the first action, so that your agent won't play this action anymore at step 0. At the end, if the first action is not determining, and you still propagate the reward back, all step0/actions pairs will converge to the same value. – Dany Yatim Jan 06 '20 at 11:27
  • would you stil recommend some sort of Epsilon Funtion for some randomizing? If yes what kind of Epislon Funtion? P.S. Thanks for all the advice. – nailuj05 Jan 06 '20 at 12:41
  • your are welcome :) happy to help. Of course, epsilon function is for exploration purpose, which is important in reinforcement learning to asses a value to a state/action pair. The more you explore, the more you have a 'complete' view of how much action a in state x is good or not. You could use a decaying epsilon (starting value aorund 0.9) i.e eps *= 0.99, max(eps,0.1) for example (lot of exploration at the begining, and a constant low epsilon at the end). If your algorithm gets stuck early, try to lower the decay rate to let it explore longer. – Dany Yatim Jan 06 '20 at 13:28