0

I have a 500*500 grid with 7 different penalty values. I need to make an RL agent whose action space contains 11 actions. (Left, Right, Up, Down, 4 Diagonal Directions, Speed Up, Speed Down And Normal Speed). How can I solve this problem? The Probability of 'action performed' which was chosen is 0.8. Otherwise a random action is selected. Also, the penalty values can change dynamically.

  • What do you mean by the penalty values change dynamically? Is it something where state 1 could return some distributions with a mean of x? or is it completely uniform? Are the dynamic penalty values just handling reward shaping for you? – Derek_M May 09 '17 at 16:47
  • By dynamic change, I mean, suppose at one instance, reaching state 1, gives a penalty of 4. At other instance, reaching state 1, may give a penalty of 5. You can take it as, state 1 giving a penalty drawn from a normal distribution. This is true for every state. –  May 11 '17 at 04:38

1 Answers1

0

Take a look at this chapter by Sutton incompleteideas.net/sutton/book/ebook/node15.html, especially his experiments in later sections. Your problem seems similar to the N-Armed bandit in that each of the arms returns a normal distribution of reward. While this chapter mostly focuses on exploration, the problem applies.

Another way to look at it, is if your state really returns a normal distribution of penalties, you will need to sufficiently explore the domain to get the mean of the state, action tuple. The mean in these cases is Q*, which will give you the optimal policy.

As a follow up, if the state space is too large or continuous, it may be worth looking into generalization with a function approximator. While the same convergence rules apply, there are cases where function approximations run into issues. I would say that is beyond the scope of this discussion though.

Derek_M
  • 1,018
  • 10
  • 22