What's the principle to design the reward function, of DQN?

Question

I'm designing a reward function of a DQN model, the most tricky part of Deep reinforcement learning part. I referred several cases, and noticed usually the reward will set in [-1, 1]. Considering if the negative reward is triggered less times, more "sparse" compared with positive reward, the positive reward could be lower than 1.

I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)? What's the theory or principle behind the range?

I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?

I can vaguely understand that correlated with gradient descent, and actually it's the gap between rewards matters, not the sign or absolute value. But I'm still missing clear hint how it can destroy, and why in such range.

Besides, when should I utilize reward like [0,1] or use only negative reward? I mean, within given timestep, both methods seems can push the agent to find the highest total reward. Only in situation like I want to let the agent reach the final point asap, negative reward will seems more appropriate than positive reward.

Is there a criteria to measure if the reward is designed reasonable? Like use the Sum the Q value of good action and bad action, it it's symmetrical, the final Q should around zero which means it converge?

score 2 · Answer 1 · answered Aug 06 '20 at 03:55

I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)?

Essentially it's the same if you define your reward function in either [0,1] or [-1,0] range. It will just result in your action values being positive or negative, but it wouldn't affect the convergence of your neural network.

I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?

I wouldn't really agree with the answer. Such a reward function wouldn't "destroy" the model, however it is incapable of providing a balanced positive and negative reward for the agent's action. It provides incentive for the agent not to crash, however doesn't encourage it to cut off opponents.

Besides, when should I utilize reward like [0,1] or use only negative reward?

As mentioned previously, it doesn't matter if you use positive or negative reward. What matters is the relativity of your reward. For example as you said if you want the agent to reach the terminal state asap, thus introducing negative rewards, it will only work if no positive reward is present during the episode. If the agent could pick up positive reward midway through the episode, it would not be incentivized to end the episode asap. Therefore, it's the relativity that matters.

score 1 · Answer 2 · answered Oct 08 '20 at 14:14

What's the principle to design the reward function, of DQN?

As you said, this is the tricky part of RL. In my humble opinion, the reward is "just" the way to leads your system to the (state, action) pairs that you valuate most. So, if you consider that one pair (state, action) is 500x greater than the other, why not?

About the range of values... suppose that you know all the rewards that can be assigned, thus you know the range of values, and you could easily normalize it, let's say to [0,1]. So, the range doesn't mean to much, but the values that you assigned says a lot.

About negative reward values. In general, I find it in problems where the objective is to minimize costs. For instance, if you have a robot that has the goal do collect trash in a room, and from time to time he has to recharge himself to continue doing this task. You could have negative rewards regarding battery consumption, and your goal is to minimize it. On another hand, in many games the goal is to score more and more points, so can be natural to assign positive values.

What's the principle to design the reward function, of DQN?

2 Answers2