1

I've built a custom reinforcement learning environment and agent which is similar to a labyrinth game.

In labyrinth there're 5 possible actions: up, down, left, right, and stay. While if blocked, e.g. agent can't go up, then how do people design env and agent to simulate that?

To be specific, the agent is at current state s0, and by definition taking actions of down, left, and right will change the state to some other values with an immediate reward (>0 if at the exit). One possible approach is when taking action up, the state will stay at s0 and the reward will be a large negative number. Ideally the agent will learn that and never go up again at this state.

However, my agent seems not learning this. Instead, it still goes up. Another approach is to hard code the agent and the environment that the agent will not be able to perform the action up when at s0, what I can think of is:

  1. when at some state when up is not allowed, we look at the Q values of different actions
  2. pick the action with the largest Q value except up
  3. therefore, the agent will never perform an invalid action

I'm asking is the above approach feasible? Will there be any issues related to that? Or is there a better design to deal with the boundary and invalid actions?

Kevin Fang
  • 1,966
  • 2
  • 16
  • 31

2 Answers2

1

I have seen this problem many times where an agent would stuck to a single action. I have seen that in the following cases:

  1. The input images were not normalized so the gradients were becoming huge and the whole network was saturating to a single action.
  2. I was not using entropy bonus to increase the randomness in the initial search. Please find more details about this work here.

I hope it could help.

Abhishek Mishra
  • 1,984
  • 1
  • 18
  • 13
  • Hi Abhishek, thanks for your answer, but this is not what I mean. The model does not stuck in one action, but trying to do 'physically impossible' actions, e.g. in Montezuma's revenge you cannot go 'down' when you're on the ground, but you can go 'down' on ladders – Kevin Fang Jul 09 '18 at 03:26
0

I would say this should work (but even better than guessing is trying it). Other questions would be: What is the state your agent is able to observe? Are you doing reward clipping?

On the other Hand, if your agent did not learn to avoid running into walls there might be another Problem within your learning Routine (maybe there is a bug in the reward function?)

Hard coded clipping Actions might lead to a behavior which you want to see, but it certainly cuts down the Overall performance of your agent.

Whatelse did you implement? If not done yet, it might be good to take experience replay into account.

mrk
  • 8,059
  • 3
  • 56
  • 78
  • The state I observe is a series of continuous values, and the model is expected to learn the sign of the difference of values. Yeah I already added experience replay, and currently hard coded 'illegal' actions. Though I still want it to 'automatically' know what action is 'illegal' after sufficient time of training. – Kevin Fang Jul 09 '18 at 03:28