0

I'm new to deep reinforcement learning, DQN model. I used Open AI gym to reproduce some experiment named CartPole-v0 and MountainCar-v0 respectively.

I referred code from Github, CartPole-v0:https://gist.github.com/floodsung/3b9d893f1e0788f8fad0e6b49cde70f1 MountainCar-v0 :https://gist.github.com/floodsung/0c64d10cab5298c63cd0fc004a94ba1f.

Both model can run successfully and get the reward of test episodes as expected. But the reward for each time step of the two models are different.

For CartPole-v0, the reward is +1 and 0. Each episode has 300 time steps, and the agent tries to as much total reward as it can. The source code follows: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

But in MountainCar-v0 , the reward is always -1 for all the actions, so the agent is trying to end up getting less negative reward than usual. It also explained here How does DQN work in an environment where reward is always -1.

So it makes me confused how to determine the reward for actions or states? It seems both a positive reward or a negative reward make sense with in limited time step? What's the principle to choose which one to use. And I see sometimes the reward can be float number between them.

And how to avoid the case "suicide itself", where the agent suicide itself instead of trying to reach the target, because of the "live penalty" (the agent receives a penalty each step, to speed up the exploitation phase over the exploration). >https://datascience.stackexchange.com/questions/43592/rl-weighthing-negative-rewards

Thanks in advance!

赵天阳
  • 109
  • 1
  • 11

1 Answers1

0

There are two points to consider: First, in DQN the agent tries to maximize the approximation of the Q-value by:

Q(s_t,a) = r(s_t,a_t) + \gamma * \max_{a} Q(s_{t+1}, a)

So, whatever the rewards are, DQN wants to learn the policy that maximizes the long-term reward which is Q(s,a). In both of your example, DQN selects actions that achieve higher rewards, in CartPole it is 200 and MountainCar closer to zero would be the best.

The second point is that DQN uses the Target network to obtain the target value and train the Q-network. In Target network, the target value is:

target-value = r(s_t,a_t) + (1-done)*\gamma * \max_{a} Q(s_{t+1}, a)

in which target-value is equal to r(s_t,a_t) if done==1. In other words, DQN has the access to the knowledge of being in the terminal state and uses that to learn a smart policy.

Afshin Oroojlooy
  • 1,326
  • 3
  • 21
  • 43