I've built a custom reinforcement learning environment
and agent
which is similar to a labyrinth game.
In labyrinth there're 5 possible actions: up, down, left, right, and stay. While if blocked, e.g. agent can't go up, then how do people design env
and agent
to simulate that?
To be specific, the agent is at current state s0
, and by definition taking actions of down, left, and right will change the state to some other values with an immediate reward (>0 if at the exit). One possible approach is when taking action up
, the state will stay at s0
and the reward will be a large negative number. Ideally the agent will learn that and never go up
again at this state.
However, my agent seems not learning this. Instead, it still goes up
. Another approach is to hard code the agent and the environment that the agent will not be able to perform the action up
when at s0
, what I can think of is:
- when at some state when
up
is not allowed, we look at the Q values of different actions - pick the action with the largest Q value except
up
- therefore, the agent will never perform an invalid action
I'm asking is the above approach feasible? Will there be any issues related to that? Or is there a better design to deal with the boundary and invalid actions?