1

I am trying to learn about Markov decision problems and I was given the algorithm for Value Iteration, but I am confused how to turn them into actual C++ code. Mainly the parts where summations and such occur. Here is the algorithm:

function VALUE-ITERATION(P;R) returns a utility matrix
    inputs: P, a transition-probability matrix
            R, a reward matrix
    local variables: U, utility matrix, initially identical to R
                     U', utility matrix, initially identical toR
    repeat
         U <- U'
         for each state i do
             U'(s_i) <-  R(s_i) + max_a Summation_j P^a_ij*U(s_j)
         end
    until max_(s_i) |U(s_i) - U'(s_i)| < e
return U

This looks like hieroglyphics to me, is there a simpler algorithm that would be of more help to me? Or could somebody dumb it down for me?

David Hammen
  • 32,454
  • 9
  • 60
  • 108
James Brown
  • 919
  • 3
  • 13
  • 22
  • if you are not able to follow this kind of pseudo code, then you will have a hard time with _any_ other algorithm you might face, no matter which language. – akira Dec 04 '12 at 20:06
  • What is `max_a Summation_j P^a_ij*U(s_j)`? That makes no sense to me. Has some formatting been lost? What is `j`? What is `e`? – Mooing Duck Dec 04 '12 at 20:06
  • @MooingDuck - `max_a Summation_j P^a_ij*U(s_j)`: `P^a_ij` is the probability that action `a` will take the system from state `s_i` to state `s_j`. `P^a_ij*U(s_j)` is the probability-weighted benefit of transitioning from state `s_i` to state `s_j` via action `a`. Summing `P^a_ij*U(s_j)` over all state indices `j` gives the overall probability-weighted benefit of action `a`. Taking the maximum over all actions picks the action that is most likely to do the best good to the system. – David Hammen Dec 04 '12 at 20:15
  • Why all the downvotes and votes to close? – David Hammen Dec 04 '12 at 20:39
  • `max_(s_i) | U(s_i) - U'(s_i)j < e`: That's a typo. It should be `max_(s_i) | U(s_i) - U'(s_i) | < e`. – David Hammen Dec 04 '12 at 20:51
  • Probably downvotes due to my poor formatting skills. I really appreciate all your help @David, and I appreciate your help as well MooingDuck. Thanks for sticking with the question and I really found both of your answers to be very helpful! – James Brown Dec 04 '12 at 21:19
  • Probably because your pseudocode looks like hieroglyphics. Well, it does. Welcome to the world of AI, especially the more 'mathy' parts of AI. To those who voted to close: *This question is not too localized.* Markov decision problems are a central concept in AI. – David Hammen Dec 04 '12 at 21:30

2 Answers2

3

I found this article quite readily: Value iteration and policy iteration algorithms for Markov decision problem [PDF file]. It explains quite a bit more what's going on.

Conceptually, you have a system that can be in a number of states, rewards for transitions from one state to another, and actions that sometimes can result in state transitions. The basic idea is to keep iterating until you have arrived at a utility matrix that doesn't change That's what that final test max_(s_i) | U(s_i) - U'(s_i)| < e looks for. (Here, e is short for epsilon, a small number, and probably should be an additional input.)

For each iteration, you want to take the best action for each state. The best action is the one that yields the greatest reward, weighted by probability. That's what max_a Summation_j P^a_ij*U(s_j) does: Find the action that yields the best reward, weighted by probability.

David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • 1
    An aside, mostly to James Brown: Learn how to use and abuse search engines. I googled *function VALUE-ITERATION(P;R) returns a utility matrix* and voila, the cited article was right there at the top of the search results. If this search hadn't worked, I would have refined my search terms until I eventually did find something that properly described that algorithm. – David Hammen Dec 04 '12 at 22:36
2

I can translate bits and pieces, but there's a lot of information in your code that only makes sense in context, and there's no way for us to know that context.
Also, it appears that some formatting was lost along the way, since P^a_ij looks like it was at one point P to the power of a_i times j. David seems to know how to interpret the crazy bit.
Also the condition loop uses | in the pseudo-code which is wierd, but I took it literally.

utility_matrix VALUE_ITERATION(const probability_matrix& P,
                               const reward_matrix& R)
{
    utility_matrix U(R);
    utility_matrix UP(R);
    do {
        U = UP;
        for(int s_i : ????) //for each state in what?
            UP[s_i] = R[s_i] + ???? //max_a Summation_j P^a_ij*U(s_j)
    while(max(s_i) ???? std::abs(U[s_i] - UP[s_i])<e);
    return U;
}

As akira said, the understandable bits are straightforward, if you couldn't do those, you might need to learn more about C++ before you tackle this.

As per your comment, I found C code that looks vaguely like your algorithm here. (Lines 62-76)

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • I understand the loops and nearly everything but the actual summations, I don't know how to format the summations correctly on here, but to clarify the P^a with subscript ij, the algorithm was found in Artificial Intelligence A Modern Approach Figure 17.4. I've tried finding an image or link to it to make my actual question more useful. – James Brown Dec 04 '12 at 20:19
  • Re *`P^a_ij` looks like it was at one point P to the power of `a_i` times `j`*: That's wrong. The `a` is just a superscript and is distinguished from the subscripts `i` and `j` because `a` is an action while `i` and `j` are state indices. Think of `P` as a three dimensional matrix. – David Hammen Dec 04 '12 at 20:36
  • that bit of java code you found was veryyy helpful! Thanks! – James Brown Dec 04 '12 at 20:45
  • Re *Also the condition loop uses | in the pseudo-code which is wierd, but I took it literally.* That was a typo in the question. It should have been `max_(s_i) |U(s_i) - U'(s_i)| < e`. Here the `|xxx|` means absolute value, not some conditional logic. – David Hammen Dec 04 '12 at 20:53
  • +1 for a good try at making sense of an apparently inscrutable, hieroglyphic AI algorithm. – David Hammen Dec 04 '12 at 22:28
  • @DavidHammen: Although I appreciate the sentiment, I don't think "good try" should be getting an upvote on SO. Kinda defeats the point. – Mooing Duck Dec 04 '12 at 22:32
  • You gave an outline for an algorithm with which you had no familiarity. Knowing how to ferret out what one can understand, leaving the rest as `???` up to someone else is an important ability. Knowing how to teach someone else to do that is an even more important ability. – David Hammen Dec 04 '12 at 22:44