1

I'm wondering how discounting rewards for reinforcement learning actually works. I believe the idea is that rewards later in an episode get weighted heavier than early rewards. That makes perfect sense to me. I'm having a hard time understanding how this actually works in the examples I've seen.

I'm assuming the code below is a standard way to do reinforcement learning. I'm interpreting this code as follows: Go through each action, train model that the prediction action was good or bad.

What this appears to be doing is uniformly multiplying all my predictions by whatever gamma is, adding the reward, and using that to train the model.

Seeing as the reward is always updated each step, I'm having a hard time understanding how this is achieving the goal of making early actions in the episode less encouraged/discouraged than later ones. Shouldn't the rewards get added together from step to step and then multiplied by the gamma to achieve this?

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)

        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = reward + self.gamma * np.amax(self.model.predict(next_state))
            target_f = self.model.predict(state)
            target_f[0][action] = target

            self.model.fit(state, target_f, epochs=1, verbose=0)

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
Perks
  • 11
  • 2

1 Answers1

0

You seem to have a few misconceptions about what problem the code is solving. I'll try to clear up the one about discounted rewards.

Let's first assume that we do not discount the rewards. The value of taking an action in a given state is defined as the sum of rewards that the agent is expected to collect when it takes this action and then follows a fixed policy.

We could use this definition and learn the value function. But one problem is, if the agent lives forever it will possibly collect infinite reward. Also, the agent will be under no pressure to act. It will happily go through a million of bad states, if it helps to slowly get into a good state where it can stay forever. It is harder to learn such action-values (and have them stabilize) if we look ahead for millions of time steps.

So this is solved by discounted rewards. The goal of the agent is modified to maximize not the sum of rewards, but the immediate reward plus 0.9 times the next reward, plus 0.9*0.9 times the next, etc. so the discounted reward after a million time steps is, for all practical means, irrelevant for the agent's current decision. This has nothing to do with the beginning or end of an episode. The reward discounting always starts from the current state.

This line you're looking at:

target = reward + self.gamma * np.amax(self.model.predict(next_state))

is calculating a better estimate of the action-value. This is the standard textbook formula (see e.g. "Reinforcement Learning" by Sutton and Barto). It uses the predictor itself (which is still being trained) to estimate the value (sum of discounted rewards) of actions that are later into the future, discounted by one time step with gamma.

maxy
  • 4,971
  • 1
  • 23
  • 25
  • Hi. Thanks for the explanation. I do want to understand that while coding policy gradient, why do we always reverse the reward list and then calculate the discounted sum of reward. Can you explain that? – Sarvagya Gupta Apr 04 '20 at 08:44