I'm wondering how discounting rewards for reinforcement learning actually works. I believe the idea is that rewards later in an episode get weighted heavier than early rewards. That makes perfect sense to me. I'm having a hard time understanding how this actually works in the examples I've seen.
I'm assuming the code below is a standard way to do reinforcement learning. I'm interpreting this code as follows: Go through each action, train model that the prediction action was good or bad.
What this appears to be doing is uniformly multiplying all my predictions by whatever gamma is, adding the reward, and using that to train the model.
Seeing as the reward is always updated each step, I'm having a hard time understanding how this is achieving the goal of making early actions in the episode less encouraged/discouraged than later ones. Shouldn't the rewards get added together from step to step and then multiplied by the gamma to achieve this?
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * np.amax(self.model.predict(next_state))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay