2

I have a task, where I have to calculate optimal policy (Reinforcement Learning - Markov decision process) in the grid world (agent movies left,right,up,down).

In left table, there are Optimal values (V*). In right table, there is sollution (directions) which I don't know how to get by using that "Optimal policy" formula. Y=0.9 (discount factor)

enter image description here

And here is formula:

enter image description here

So if anyone knows how to use that formula, to get solution (those arrows), please help.

Edit: there is whole problem description on this page: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node35.html Rewards: state A(column 2, row 1) is followed by a reward of +10 and transition to state A', while state B(column 4, row 1) is followed by a reward of +5 and transition to state B'. You can move: up,down,left,right. You cannot move outside the grid or stay in same place.

OldFox
  • 425
  • 2
  • 7
  • 19

1 Answers1

1

Break the math down, piece by piece:

The arg max (....) is telling you to find the argument, a, which maximizes everything in the parentheses. The variables a, s, and s' are an action, a state you're in, and a state that results from that action, respectively. So the arg max (...) is telling you to find an action that maximizes that term.

You know gamma, and someone did the hard work of calculating V*(s'), which is the value of that resulting state. So you know what to plug in there, right?

So what is p(s,a,s')? That is the probability that, starting from s and doing a, you end in some s'. This is meant to represent some kind of faulty actuator-- you say "go forward!" and it foolishly decides to go left (or two squares forward, or remain still, or whatever.) For this problem, I'd expect it to be given to you, but you haven't shared it with us. And the summation over s' is telling you that when you start in s, and you pick an action a, you need to sum over all possible resulting s' states. Again, you need the details of that p(s,a,s') function to know what those are.

And last, there is r(s,a) which is the reward for doing action a in state s, regardless of where you end up. In this problem it might be slightly negative, to represent a fuel cost. If there is a series of rewards in the grid and a grab action, it might be positive. You need that, too.

So, what to do? Pick a state s, and calculate your policy for it. For each s, you're going have the possibility of (s,a1), and (s,a2), and (s,a3), etc. You have to find the a that gives you the biggest result. And of course for each pair (s,a) you may (in fact, almost certainly will) have multiple values of s' to stick in the summation.

If this sounds like a lot of work, well, that's why we have computers.

PS - Read the problem description very carefully to find out what happens if you run into a wall.

Novak
  • 4,687
  • 2
  • 26
  • 64
  • Thanks a lot for your answer. I've added problem description (rewards and link with this problem description). Could you show me, how to use that formula just for one field ? – OldFox Feb 01 '12 at 06:22
  • Ok, so let's take field in row:2,column:4 (value 17.8). You get reward only on fields: row1,col2=reward:10 and row1,col4:reward:5 (Probablility of movement is same for all directions). gama * ( p(Left) * V*(Left) + p(Up) * V*(Up) + p(Right) * V*(Right) + p(Down) * V*(Down) ) 0.9 * (1*19.8 + 1*19.4 + 1*16 + 1*16) <- those values are values around our starting point. Should we first move left, and then sum value around new point (s') ? = 64.8 (How to know that we should go left from here ?) – OldFox Feb 02 '12 at 07:07
  • No, no, no. There is no "probability of action" in this problem. There is "probability of state as a result of action." Figure 3.5b and the supporting text talks about "probability of action" but that leads to a different value function, shown in 3.5b. Figure 3.8b shows the value function when you are choosing actions. So in the square of interest, you choose an action (say, "left") and plug in all the values, and write the answer down. Then choose another action (say, "up") and do it again. Etc, for all possible actions. The biggest result is your preferred action. – Novak Feb 02 '12 at 14:39