Intuition behind policy iteration on a grid world

Question

I am supposed to come up with an MDP agent that uses policy iteration and value iteration for an assignment and compare its performance with the utility value of a state.

How does an MDP agent, given that it knows the transition probabilities and rewards, know which action to move?

From my understanding, an MDP agent will perform policy iteration and, given a policy, calculate the rewards that it gained while reaching the termination state. This policy is developed from value iteration algorithm.

Can someone provide some intuition for how policy iteration works?

Since it's a homework, you might want to grab a book or some tutorial on the Markov decision problem. http://ais.informatik.uni-freiburg.de/teaching/ss03/ams/DecisionProblems.pdf — greeness, Oct 29 '12 at 22:58
Russel and Norvig's book "Artificial Intelligence a modern approach", chapter 17 gives the timeless answer to the implementation of policy iteration algorithms: http://www.amazon.com/Artificial-Intelligence-Modern-Approach-3rd/dp/0136042597 — Eric Leschinski, Apr 23 '16 at 15:24

score 0 · Answer 1 · answered Jun 26 '13 at 20:27

Assuming you have already seen what the policy iteration and and value iteration algorithms are, the agent simply builds the new policy by selecting the action with the highest value for each state.

The value of an action is the sum of the probability of reaching a next state * (the value of the next state + the reward of the transition) over all possible next states for that action.

Intuition behind policy iteration on a grid world

1 Answers1