2

I understand the whole gist of Q-learning and its update equation:

Q(s, a) = r + \gamma * max_a' (Q(s', a'))

where %s% is the current state, a is the action taken, r is the reward, s' is the next state as a result of the action, and we maximize across the actions in that state. This is all well and good, and I was able to implement Q-learning using a table.

My struggles come when I translate to training a deep Q-network. So far, I understand that we need two networks (outputting all the Q-values for the specific state), one that is actually training, one for predicting the Q-values of the next state. The prediction network's weights periodically get updated with the main network's weights. My question focuses on what the target outputs should be during training.

For example, in the traditional classification of digits, the model is trained on with a one-hot encoded set of 10 outputs. But in, say, a game of snake with 3 actions (forward, left, right), the output we get for each action is only one number in the experience replay. We don't know what the values for the other two actions are. What should the values for the other two actions be?

I've thought about making them 0, but this doesn't make sense, as it isn't actually 0. I've also thought about making them the Q-values that are outputted from the prediction network. But again, this still doesn't make sense as it's not the actual Q-values. Must I manually back-propagate the connections just from this output for each sample (this warrants another question as I have no idea how to do it)?

EDIT: A thought just occurred. Maybe we can use the Q-values that are outputted from the target network as the output itself for the missing Q-values? In this case, the gradients corresponding to these outputs would be zero since there's no change, and so the corresponding weights won't get updated. Is this a proper hack to this problem or will it introduce unintended biases for how the target model trains?

Rangumi
  • 43
  • 8

1 Answers1

0

To answer your third paragraph, in the train q-network we take step by step action, either the next state is one with maximum q-value or random state (epsilon greedy), we DO NOT traverse all possible state from a given state but rather a specific sequence. Hence either forward, left or right, only for one the gradient will be taken later in training. Though the NN will output Q-values for all possible action, only 1 action and it's corresponding gradient have impact on training; for the other action gradient is taken 0. For details and good clarity see this github, it is the solved coding assignment of a Andrew NG course: https://github.com/azminewasi/Machine-Learning-AndrewNg-DeepLearning.AI/tree/main/3%20Unsupervised%20Learning%2C%20Recommenders%2C%20Reinforcement%20Learning/W3/Lab