1

I understood how belief states are updated in POMDP. But in Policy and Value function section, in http://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process I could not figure out how to calculate value of V*(T(b,a,o)) for finding optimal value function V*(b). I have read a lot of resources on the internet but none explain how to calculate this clearly. Can some one provide me with a mathematically solved example with all the calculations or provide me with a mathematically clear explanation.

Bugs Bunny
  • 11
  • 2

2 Answers2

0

You should check out this tutorial on POMDPs:

http://cs.brown.edu/research/ai/pomdp/tutorial/index.html

It includes a section about Value Iteration, which can be used to find an optimal policy/value function.

ziggystar
  • 28,410
  • 9
  • 72
  • 124
  • The link proves good explanation of how value iteration works. But it does not give enough details on how the value is calculated. It assumes already a value is computed for a given action - observation and does not provide details on how we get that value in that belief state. – Bugs Bunny Oct 25 '14 at 22:06
0

I try to use the same notation in this answer as Wikipedia. First I repeat the Value Function as stated on Wikipedia:

value function

V*(b) is the value function with the belief b as parameter. b contains the probability of all states s, which sum up to 1:

sum_b

r(b,a) is the reward for belief b and action a which has to be calculated using the belief over each state given the original reward function R(s,a): the reward for being in state s and having done action a.

reward def

We can also write the function O in terms of states instead of belief b:

o prob

this is the probability of having observation o given a belief b and action a. Note that O and T are probability functions.

Finally the function τ(b,a,o) gives the new belief state b'=τ(b,a,o) given the previous belief b, action a and observation o. Per state we can calculate the new probability:

belief update

Now the new belief b' can be used to calculate iteratively: V(τ(b,a,o)).

The optimal value function can be approached by using for example Value Iteration which applies dynamic programming. Then the function is iteratively updated until the difference is smaller then a small value ε.

There is a lot more information on POMDPs, for example:

agold
  • 6,140
  • 9
  • 38
  • 54