The definition of an MDP says that the reward r(s,a,s')
is defined to be the expected reward for taking action a
in state s
to search state s'
. This means that a given (s,a,s')
can have a constant reward, or some distribution of rewards as long as it has a well defined expectation. As you've defined it, the reward is proportional to the number of obstacles passed. Because the game can continue forever, the reward for some (s,a,s')
begins to look like the sum of the natural numbers. This series diverges so it does not have an expectation. In practice, if you ran Q-learning you would probably see the value function diverge (NaN values), but the policy in the middle of learning might be okay because the values that will grow the fastest will be the best state action pairs.
To avoid this, you should choose a different reward function. You could reward the agent with whatever its score is when it dies (big reward at the end, zero otherwise). You would also be fine giving a living reward (small reward each time step) as long as the agent has no choice but to move forward. As long as the highest total rewards are assigned to the longest runs (and the expectation of the reward for a (s,a,s')
tuple is well defined) it's good.