5

In learning about MDP's I am having trouble with value iteration. Conceptually this example is very simple and makes sense:

If you have a 6 sided dice, and you roll a 4 or a 5 or a 6 you keep that amount in $ but if you roll a 1 or a 2 or a 3 you loose your bankroll and end the game.

In the beginning you have $0 so the choice between rolling and not rolling is:

k = 1
If I roll : 1/6*0 + 1/6*0 + 1/6*0 + 1/6*4 + 1/6*5 + 1/6*6 = 2.5 
I I don't roll : 0
since 2.5 > 0 I should roll

k = 2:
If I roll and get a 4:
    If I roll again: 4 + 1/6*(-4) + 1/6*(-4) + 1/6*(-4) + 1/6*4 + 1/6*5 + 1/6*6 = 4.5
    If I don't roll: 4
    since 4.5 is greater than 4 I should roll

If I roll and get a 5:
    If I roll again: 5 + 1/6*(-5) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5
    If I don't roll: 5
    Since the difference is 0 I should not roll

If I roll and get a 6:
    If I roll again: 6 + 1/6*(-6) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5.5
    If I don't roll: 6
    Since the difference is -0.5 I should not roll

What I am having trouble with is converting that into python code. Not because I am not good with python, but maybe my understanding of the pseudocode is wrong. Even though the Bellman equation does make sense to me.

I borrowed the Berkley code for value iteration and modified it to:

isBadSide = [1,1,1,0,0,0]

def R(s):
    if isBadSide[s-1]:
        return -s
    return s

def T(s, a, N):
    return [(1./N, s)]

def value_iteration(N, epsilon=0.001):
    "Solving an MDP by value iteration. [Fig. 17.4]"
    U1 = dict([(s, 0) for s in range(1, N+1)])
    while True:
        U = U1.copy()
        delta = 0
        for s in range(1, N+1):
            U1[s] = R(s) + max([sum([p * U[s1] for (p, s1) in T(s, a, N)])
                                        for a in ('s', 'g',)])

            delta = max(delta, abs(U1[s] - U[s]))

        if delta < epsilon:
             return U

    print(value_iteration(6))
    # {1: -1.1998456790123457, 2: -2.3996913580246915, 3: -3.599537037037037, 4: 4.799382716049383, 5: 5.999228395061729, 6: 7.199074074074074}

Which is the wrong answer. Where is the bug in this code? Or is it an issue of my understanding of the algorithm?

Sam Hammamy
  • 10,819
  • 10
  • 56
  • 94
  • A couple of questions. 1) If I roll `5; 5; 5; 1`, will the reward be `10` or `0`? 2) Since once I roll `1`, the game is over, the transition probabilities are not all equal, right? `P(1, 6) = P(1, 1) = 0`. – Anton Aug 27 '17 at 22:20
  • I see your points. The way I think of it is if I roll `1` I loose the money so the reward is `-10`, right? And `P(1,1)` is `1/6`. The probability of landing on any number is `1/6` right? – Sam Hammamy Aug 28 '17 at 00:30
  • I see what you're saying about `P(1,1)`. Once you land on a `1` the game is over so there no more transition probability – Sam Hammamy Aug 28 '17 at 00:31
  • But it means the reward depends on all the previous states. And if the reward is not a function of the current state, the action, and the next state, then it's not really a Markov Decision Processes, is it? – Anton Aug 28 '17 at 00:41
  • Another good point! – Sam Hammamy Aug 28 '17 at 00:43
  • Just to be clear: I am approaching this correctly? Or should I be using another representation for the states? – Sam Hammamy Aug 28 '17 at 01:53

1 Answers1

3

Let B be your current balance.

If you choose to roll, the expected reward is 2.5 - B * 0.5.

If you choose not to roll, the expected reward is 0.

So, the policy is this: If B < 5, roll. Otherwise, don't.

And the expected reward on each step when following that policy is V = max(0, 2.5 - B * 0.5).


Now, if you want to express it in terms of the Bellman equation, you need to incorporate the balance into the state.

Let the state <Balance, GameIsOver> consist of the current balance and the flag that defines whether the game is over.

  • Action stop:
    • turns the state <B, false> into <B, true>
  • Action roll:
    • turns <B, false> into <0, true> with the probability 1/2
    • turns <B, false> into <B + 4, false> with the probability 1/6
    • turns <B, false> into <B + 5, false> with the probability 1/6
    • turns <B, false> into <B + 6, false> with the probability 1/6
  • No action can turn <B1, true> into <B2, false>

Using the notation from here:

π(<B, false>) = "roll", if B < 5

π(<B, false>) = "stop", if B >= 5

V(<B, false>) = 2.5 - B * 0.5, if B < 5

V(<B, false>) = 0, if B >= 5

Anton
  • 3,113
  • 14
  • 12
  • This looks like you worked it out on paper then decided how to represent the states. What if N is `21` or `42` instead of `6`? – Sam Hammamy Aug 28 '17 at 12:52
  • I get the balance has to be part of the state. But I don't see how game is over should be part of the state? I won't know that in advance when writing the value iteration? – Sam Hammamy Aug 28 '17 at 12:53
  • @SamHammamy If `N = 42`, there will be up to `43` different states the game can transition to at every iteration. You don't know in advance when the game will be over, but you can iterate over all outcomes, including those that stop the game. – Anton Aug 28 '17 at 13:12
  • @SamHammamy You can't apply the value iteration algorithm as is, because the number of all possible states is infinite. Reducing them to a finite number of "meaningful" states is what needs to be worked out on paper. For example, in this case, the only states you care about are `<0, false>`, `<1, false>`, `<2, false>`, `<3, false>`, `<4, false>`, `<5, false>`, `<6, false>`, `<7, false>`, `<8, false>`, `<9, false>`, `<10, false>`, `<0, true>`, `<1, true>`, `<2, true>`, `<3, true>`, `<4, true>`, `<5, true>`, `<6, true>`, `<7, true>`, `<8, true>`, `<9, true>`, `<10, true>`. – Anton Aug 28 '17 at 13:13
  • @SamHammamy Since you never roll when `B >= 5`, there is no way to reach a state with `B >= 11` (if you follow the optimal policy). – Anton Aug 28 '17 at 13:16
  • I see what you are saying. I can start with `<0, false> * 3` and `<0, true> *3>` for the N = 6 problem, and `<0, false> * 30` plus `<0, true> * 12>` if it happened to be that with `N=42` there are `12` that terminate that game. I will re-write the code above with this later today and report back here. – Sam Hammamy Aug 28 '17 at 13:23
  • 1
    @SamHammamy were you able to figure this out? I'm having trouble conceptualizing the solution. – asing May 20 '18 at 18:49
  • @asing No, unfortunately I never did! – Sam Hammamy May 23 '18 at 13:49