0

I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.

This example is from Sutton and Barton's book on reinforcement learning.

The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.

My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.

This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?

user3425451
  • 25
  • 1
  • 7

2 Answers2

1

For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.

For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.

Eventually you just need to update your Q-function according to: $Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$ Set $s=s'$ and repeat the whole process till convergence.

Hope this helps

Constantinos
  • 111
  • 1
  • 3
0

Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.

broccoli
  • 4,738
  • 10
  • 42
  • 54