1

I want to implement Q-Learning for the Chrome dinosaur game (the one you can play when you are offline).

I defined my state as: distance to next obstacle, speed and the size of the next obstacle.

For the reward I wanted to use the number of successfully passed obstacles, but it could happen that the same state has different immediate rewards. The same type of obstacle could reappear later in the game, but the reward for passing it would be higher because more obstacles have already been passed.

My question now is: Is this a problem or would Q-Learning still work? If not is there a better way?

Nick Walker
  • 790
  • 6
  • 19
7Orion7
  • 57
  • 1
  • 7
  • I'd suggest a reward scheme with a large negative reward for dying. And a positive reward every time the score increments up (probably equal to the score increment). I don't see an issue for Q-learning - I've seen amazing pacman agents using vanilla Q-learning. Dinosaur game shouldn't be an issue. – Patrick Coady Apr 16 '17 at 01:46

1 Answers1

3

The definition of an MDP says that the reward r(s,a,s') is defined to be the expected reward for taking action a in state s to search state s'. This means that a given (s,a,s') can have a constant reward, or some distribution of rewards as long as it has a well defined expectation. As you've defined it, the reward is proportional to the number of obstacles passed. Because the game can continue forever, the reward for some (s,a,s') begins to look like the sum of the natural numbers. This series diverges so it does not have an expectation. In practice, if you ran Q-learning you would probably see the value function diverge (NaN values), but the policy in the middle of learning might be okay because the values that will grow the fastest will be the best state action pairs.

To avoid this, you should choose a different reward function. You could reward the agent with whatever its score is when it dies (big reward at the end, zero otherwise). You would also be fine giving a living reward (small reward each time step) as long as the agent has no choice but to move forward. As long as the highest total rewards are assigned to the longest runs (and the expectation of the reward for a (s,a,s') tuple is well defined) it's good.

Nick Walker
  • 790
  • 6
  • 19