How does neural network know which reward it got from action?

Question

I am current working on making a Deep q-network and i a bit confused about how my Q-network knows which reward i give it.

For example I have this state action function with policy and temporal difference:

and then I have my Q-network:

Where I input my states and I get 4 different q values in the same observation. Theory wise how do I reward my Q-network because my only inputs are the state but not the reward.

I hope one can explain me this!

If by "how do I reward my Q-net" you mean "which loss do I use to train my Q-net", the answer is: the TD error, which is `r + gamma*Q(s',pi(s')) - Q(s,a)`, where `s'` is the next state and `pi` is your policy. — Simon, Feb 26 '18 at 18:29

score 2 · Accepted Answer · answered Feb 23 '18 at 09:21

You should be familiar with training and inference.

In the training phase, you provide inputs and the desired outputs to the neural network. The exact way in which you encode the desired outputs can vary; one way is to define a reward function. The weights adjustment procedure is then defined to optimize the reward

In production, the network is used for inference. You now use it to predict the unknown outcomes, but you don't update the weights. Therefore, you don't have a reward function in this phase.

This makes neural networks a form of supervised learning. If you need unsupervised learning, you generally have a bigger problem, and might need different algorithms. One sort-of exception is when you can automatically evaluate the quality of your predictions in hindsight. An example of this is the branch predictor of CPU's; this can be trained using the actual data from branches taken.

But what if i my deep reinforcement algorithm have to be training while it is in the inference phase :)? — Søren Koch, Feb 24 '18 at 15:20

How does neural network know which reward it got from action?

1 Answers1