How is the optimal policy for recurrent utilities calculated?

Question

I am learning the Markov Decision Process and for Question 6 of the exam (see the link attached above), I understand how utility is calculated when the same state is obtained after an action (part a of Question 6).

J*(cool) = 4 + 0.9 * J*(cool)

But I don't get how the calculations for the other states and actions can be made (part b of Question 6). I am assuming the equations would be something like this:

For action "fast" in state "cool":

J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))

For action "slow" in state "warm":

J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))

For action "fast" in state "warm":

J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * J*(off))

But we do not have a single variable in these equations and we don't have the utilities of these states? How can we get the value of expected utilities associated with each action?

score 1 · Accepted Answer · answered Jan 16 '15 at 00:31

You're on the right track with those equations. You just need to consider each of the four possible policies in turn: (slow, slow), (fast, slow), (slow, fast), (fast, fast).

Consider (slow, fast):

From a) you have already seen J*(cool) = 40.

J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * J*(off))
J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * 0)
J*(warm) = 47.06

For (slow, slow):

Again J*(cool) is independent of your action in the warm state so J*(cool) = 40.

J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))
J*(warm) = 4 + 0.9 * (0.5 * 40 + 0.5 * J*(warm))
J*(warm) = 40

And for (fast, fast):

This time the value of being in the warm state is independent of the cool action and is J*(warm) = 47.06, from above.

J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * 47.06)
J*(cool) = 53.89

Lastly (fast, slow):

This is the hardest case, but we have two equations and two unknowns so we can solve using simultaneous equations.

J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))
J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))

J*(warm) = (4 + 0.45 * J*(cool))/0.55

J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * (4 + 0.45 * J*(cool))/0.55)
J*(cool) = 66.94
J*(warm) = 62.04

As we can see the highest value that can be obtained if we start in the warm state is 62.04. The highest value starting in cool is 66.94. Both of these occur when our policy is (fast, slow), ie fast in cool, slow in warm, hence this is the optimal policy.

As it turns out it is not possible to have a policy that is optimal is you start in state A but not optimal if you start in state B. It is also worth noting that for these types of infinite time horizon MDPs, you can prove that the optimal policy will always be stationary, that is if it is optimal to take the slow action in the cool state at time 1, it will be optimal to take the slow action for all times.

Finally, in practice the number of states and actions are much larger than in this question and more advanced techniques, such as value iteration, policy iteration or dynamic programming are typically required.

How is the optimal policy for recurrent utilities calculated?

1 Answers1