1

consider a simple GridWorld 3x4 with reward -0.04

[ ][ ][ ][+1]
[ ][W][ ][-1]
[ ][ ][ ][  ]

where W is a wall, +1/-1 are terminal states. An agent can move in any direction, but only 80% of the times he succeeds in going to the planned direction, 10% he goes right (relative to direction), 10% left.

In Policy Iteration algorithm, we first generate a random policy, let's say this policy gets generated:

[L][L][L][+1]
[L][W][L][-1]
[L][L][L][L ]

where L means left.

Now we run Modified value iteration algorithm until the values at neighbouring iterations don't differ much.

We initialize values at 0 (except for terminals states)

[0][0][0][+1]
[0][W][0][-1]
[0][0][0][0 ]

But here's what I don't get:

Since we use the formula 0.8*previousValueFromForwardState + 0.1*previousValueFromLeftState + 0.1*previousValueFromRightState + Reward to fill new states, that pretty much means that whatever is behind the policy direction at state won't change the value in that cell. Since only terminal states +1 and -1 can get the value iteration going and they get always ignored,

wouldn't that just create an infinite loop?

With each iteration, we would always be getting multiples of 0.04, the differences between iterations will always be constant (except for lower right corner, but it won't influence anything...)

P. Lance
  • 179
  • 2
  • 13

0 Answers0