I'm having trouble understanding Monte Carlo policy evaluation algorithm. What I am reading is that G
is the average return after visiting a particular state, lets say s1
, for the first time. Does this mean averaging all rewards following that state s1
to the end of the episode and then assigning the resulting value to s1
? Or does it mean the immediate reward received for taking an action in s1
averaged over multiple episodes?

- 790
- 6
- 19

- 9,795
- 7
- 28
- 43
1 Answers
The purpose of Monte Carlo policy evaluation is to find a value function for a given policy π. A value function for a policy just tells us the expected cumulutive discounted reward that will result from being in a state, then following the policy forever or until the end of the episode. It tells us the expected return for a state.
So a Monte Carlo approach to estimating this value function is to simply run the policy and keep track of the return from each state; when I reach a state for the first time, how much discounted reward do I accumulate in the rest of the episode? Average all of these that you observe (one return per each state that you visit, per each episode that you run).
Does this mean averaging all rewards following that state
s1
to the end of the episode and then assigning the resulting value tos1
? Or does it mean the immediate reward received for taking an action ins1
averaged over multiple episodes?
So, your first thought is correct.

- 790
- 6
- 19