I'm trying to train a DRL agent to play a game using the DQN method. The game is pretty straight forward and similar to breakout. Fruits keep falling from the top of the screen (vertically) and the agent needs to just align itself to the fruit to get the reward. There are three actions that it can take : move left , stay , move right .
Let's say that a2 refers to not moving the paddle, a3 refers to moving right and a1 refers to moving left.
Let's say we take a sub-optimal action a3(move right) and move to the next state. Then the best action in that state would be to move left (a1) and then execute the optimal action. So, the only cost difference between the actions a2 and a3 will be the two steps wasted to go and come back.
If there is no negative reward for taking the sub-optimal action, then the agent has no incentive to choose the optimal action. So, the negative rewards for taking the sub-optimal action should be high enough that the agent is discouraged from doing it. I've tried to put this intuition mathematically here. This could explain why q-values are so close to each other.
Then, the optimal Q* function satisfies the following :
1) Is this correct ? (Is there a flaw in this argument ?)
2) Could this explain why Q-values are very close to each other in Deep Q learning ?