I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s
, the action taken given some policy, a
, the successive state given the action, s'
, and any reward received from that successive state, r
.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.