4

I'm looking at the FrozenLake environments in openai-gym. In both of them, there are no rewards, not even negative rewards, until the agent reaches the goal. Even if the agent falls through the ice, there is no negative reward -- although the episode ends. Without rewards, there is nothing to learn! Each episode starts from scratch with no benefit from previous episodes.

This should be a simple breadth-first search. It doesn't need RL. But assuming you use RL, one approach would be a reward of -1 for a step to a frozen square (that isn't the goal) and a reward of -10 for a step into a hole. The -1 would allow the agent to learn not to repeat squares. The -10 would allow the agent to learn to avoid the holes. So I'm tempted to create my own negative rewards on the agent side. This would make it more like the cliffwalker.

What am I missing? How would RL solve this (except via random search) with no rewards?

RussAbbott
  • 2,660
  • 4
  • 24
  • 37

2 Answers2

2

The problem you are describing is often answered with Reward Shaping.

Like the frozen lake environment or Montazuma's Revenge, some problems have very sparse rewards. This means that any RL agent must spend a long time to explore the environment to see these rewards. This can be very frustrating for the humans who designed the task for the agent. So, like in the frozen lake environment, people often add extra information like you have suggested. This makes the reward function more dense and (sometimes) allows for faster learning (if the modified reward function actually follows what the human wants the agent to do).

In order for the agent to solve these kinds of problems faster than random search and without human intervention, such as reward shaping or giving the agent a video of expert playing the game, the agent needs some mechanism to explore the space in an intelligent way[citation needed].

Some current research areas on this topic are Intrinsic Motivation, Curiosity, and Options and Option discovery.

Although promising, these research areas are still in their infancy and sometimes its just easier to say:

if agent_is_in_a_hole:
  return -10
Jaden Travnik
  • 1,107
  • 13
  • 27
1

I think the objective of this environment is to discover ways to balance exploration vs. exploitation. I think the reward manipulation is neither required or desirable. Now, if you try to run this in q-learning for the 8x8 environment, you may find that it does not converge. The fix for this was given by JKCooper on openAI forum. You can check out this page and scroll all the way to the bottom to see the comment, https://gym.openai.com/evaluations/eval_xSOlwrBsQDqUW7y6lJOevQ

In there, he introduces a concept of average terminal reward. This reward is then used to calibrate/tune the exploration. At the beginning, the average terminal reward is undefined or null. On the very first "done" iteration, this variable is updated with the value of that reward. On each subsequent iteration, if the current reward is greater than the existing value of the average terminal reward, then the epsilon value is "decayed", i.e. exploration is discouraged and exploitation is encouraged gradually.

Using this technique, you can see that qlearning converges.

the modified version on openAI is here: v0.0.2

https://gym.openai.com/evaluations/eval_FVrk7LAVS3zNHzzvissRQ/

ameet chaubal
  • 1,440
  • 16
  • 37