I’m wondering about the reward policy in a DQN model. I’m learning how to use DQN for solving cases. So, I’m applying DQN in a deterministic case that I know already the answer.
I’m developing a DQN model that finds the optimal threshold to obtain the maximum metric in a classification ML model, for example, find the best threshold that maximize F1 Score. In this example, my states are any value in range (0,1) and my actions are decrease or increase 0.01 in each state.
So, I tried several ways to set the reward policy and I found a new one in terms of the metric that I want to maximize. For example, if the F1 Score, in the next state, is greater than the F1 score in the current state, the reward is 1.
My main question is that if this kind of approach of computing rewards is optimal or correct? I’m thinking I could be violating any principle of DQN models by computing rewards in terms of next and current states.