I am supposed to come up with an MDP agent that uses policy iteration and value iteration for an assignment and compare its performance with the utility value of a state.
How does an MDP agent, given that it knows the transition probabilities and rewards, know which action to move?
From my understanding, an MDP agent will perform policy iteration and, given a policy, calculate the rewards that it gained while reaching the termination state. This policy is developed from value iteration algorithm.
Can someone provide some intuition for how policy iteration works?