I have implemented a custom openai gym environment for a game similar to http://curvefever.io/, but with discreet actions instead of continuous. So my agent can in each step go in one of four directions, left/up/right/down. However one of these actions will always lead to the agent crashing into itself, since it cant "reverse".
Currently I just let the agent take any move, and just let it die if it makes an invalid move, hoping that it will eventually learn to not take that action in that state. I have however read that one can set the probabilities for making an illegal move zero, and then sample an action. Is there any other way to tackle this problem?