1

In Deep Reinforcement Learning, using continuous action spaces, why does it seem to be common practice to clamp the action right before the agent's execution?

Examples:

OpenAI Gym Mountain Car https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py#L57

Unity 3DBall https://github.com/Unity-Technologies/ml-agents/blob/master/unity-environment/Assets/ML-Agents/Examples/3DBall/Scripts/Ball3DAgent.cs#L29

Isn't information lost doing so? Like if the model outputs +10 for velocity (moving), which is then clamped to +1, the action itself behaves rather discrete (concerning its mere execution). For a fine grained movement, wouldn't it make more sense to multiply the output by something like 0.1?

Mike Wise
  • 22,131
  • 8
  • 81
  • 104
MarcoMeter
  • 473
  • 1
  • 4
  • 14

1 Answers1

0

This is probably simply done to enforce constraints on what the agent can do. Maybe the agent would like to put out an action that increases velocity by 1,000,000. But if the agent is a self-driving car with a weak engine that can at most accelerate by 1 unit, we don't care if the agent would hypothetically like to accelerate by more units. The car's engine has limited capabilities.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • It makes sense for a constraints, which can be done regardless. It's just that most likely the acceleration won't encounter values which are in the range from 0 to 1 in situations where accelerating is estimated to be rewarding. – MarcoMeter Jan 01 '18 at 11:25
  • @MarcoMeter I'm not quite sure what you mean with that comment, can you elaborate? – Dennis Soemers Jan 01 '18 at 11:42
  • 1
    Lets assume that the neural net output is 100 and before the action's execution the value gets clamped to 1. Due to that action the agent received a reward. So the model gets positive feedback on selecting 100 as suitable action and will more likely output values at that scale. – MarcoMeter Jan 01 '18 at 14:12
  • @MarcoMeter Sure, but that's not really a problem. Actions greater than 1 will consistently be interpreted as being equal to 1 by the environment, so it's fine to reward them all as if the agent had chosen just 1. Also note that the environment's action space (in the case of gym) is defined as [-1.0, 1.0] too. So really agents shouldn't generate any output outside of that range anyway. I guess an alternative would be to ''disqualify'' an agent for illegal actions or something like that, but this works just fine too and is a bit more forgiving – Dennis Soemers Jan 01 '18 at 17:00