9

I could not understand how to update Q values for tic tac toe game. I read all about that but I could not imagine how to do this. I read that Q value is updated end of the game, but I haven't understand that if there is Q value for each action ?

bzkrtmurat
  • 189
  • 1
  • 3
  • 15

2 Answers2

6

You have a Q value for each state-action pair. You update one Q value after every action you perform. More precisely, if applying action a1 from state s1 gets you into state s2 and brings you some reward r, then you update Q(s1, a1) as follows:

Q(s1, a1) = Q(s1, a1) + learning_rate * (r + discount_factor * max Q(s2, _) - Q(s1, a1))

In many games, such as tic-tac-toe you don't get rewards until the end of the game, that's why you have to run the algorithm through several episodes. That's how information about utility of final states is propagated to other states.

Tudor Berariu
  • 4,910
  • 2
  • 18
  • 29
  • Thank you for your reply. But I can not understand Q learning for tic tac toe. You said that you don't get rewards until the end of the game. I got it. Bu I can not understand that how machine decide first action ? For example, I put "X" and machine will put "O". How machine decide where to put this "O", because I understand that there is only one Q value for complete game. – bzkrtmurat Jan 19 '15 at 10:26
  • 1
    Tic-tac-toe is a two-player game. When learning using Q-Learning you need an opponent to play against while learning. That means that you need to implement another algorithm (e.g. Minimax), play yourself or use a another reinforcement learning agent (might be the same Q-learning algorithm). – Tudor Berariu Jan 19 '15 at 10:29
  • 2
    To decide which action to take in a particular state, you need a policy. A common option when implementing Q-Learning is to use epsilon-greedy (with a decaying epsilon), which considers the trade-off between exploration and exploitation. – Tudor Berariu Jan 19 '15 at 10:32
  • No, in SARSA algorithm you don't take the maximum value for `Q` in `s2`. In SARSA you choose an action `a2` using the policy and then you update `Q(s1, a1)` considering `Q(s2, a2)` instead of `max(Q(s2, _))`. – Tudor Berariu Feb 18 '15 at 08:54
2

The problem with the standard Q Learning algorithm is that it just takes too long to propagate the values from the final to the first move because you only know the outcome of the game at the end of it.

Therefore the Q Learning algorithm should be modified. The following paper gives some details on possible modifications:

  1. a non negative reward is given after the game ends (except for draw), then the Q updates is not performed at every action step (which changes nothing), but only after the end of the game
  2. the Q updates is performed by propagating its new value from the last move backward to the first move
  3. another update formula is incorporated that also considers the opponent point of view because of the turn-taking nature of two-player game

Abstract:

This paper reports our experiment on applying Q Learning algorithm for learning to play Tic-tac-toe. The original algorithm is modified by updating the Q value only when the game terminates, propagating the update process from the final move backward to the first move, and incorporating a new update rule. We evaluate the agent performance using full-board and partial-board representations. In this evaluation, the agent plays the tic-tac-toe game against human players. The evaluation results show that the performance of modified Q Learning algorithm with partial-board representation is comparable to that of human players.

Learning to Play Tic-Tac-Toe (2009) by Dwi H. Widyantoro & Yus G. Vembrina

(Unfortunately it is behind a paywall. Either you have access to the IEEE archive or you can ask the authors to provide a copy on researchgate: https://www.researchgate.net/publication/251899151_Learning_to_play_Tic-tac-toe)

vonjd
  • 4,202
  • 3
  • 44
  • 68