I have tried to realize a simple turn-based snake game played by a tensorflow model in python: The agent can move on a board (e.g. 40x40 cells) leaving a trail at each visited cell. In each round the agent has to choose one of three possible actions (action space: turn left, turn right, do nothing) and moves then in its current direction. The goal of the agent is to survive as long as possible and not to collide with its own trail, the board wall or a trail of other players. Everytime the agent dies it gets a big negative reward, by what it should learn not to do the same move in the future.
With prolonged training I see significant learning progress (growing survival time) but I also made an observation I do not understand:
In some cases the model makes obviously wrong decisions, i.e. there are several options but it chooses the action that instantly leads to dead. Even worse the q value (softmaxed) of this deadly decision is 1.0 (100%)!
At the moment the model looks like:
model = Sequential()
model.add(Dense(256, input_shape=(125,), activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='mse', optimizer=Adam(lr=0.0005))
The input layer is a section of the board (11 x 11) with the agent in the middle plus the agents direction (one-hot).
Of course I tried some model variations (layer size, numbers of hidden dense layers) but without success so far.
My general question would be what are possible reasons for such wrong learning behaviour?