I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?