Deep Q Learning: question about back propagation

Question

I'm trying to create a reinforcement-learning neural network for the CartPole v0 problem from OpenAI Gym. I understand that to find the error of the neural network I must calculate the target Q-value from the Bellman equation and subtract that from the Q-value the neural network outputted. But doesn't that only give me the error for one of the outputs? For example, if my network outputs two Q values [A = .2, B = .8] the chosen action would be B, because it has a greater Q value. Then, using the Bellman equation I can compute the target Q value of action B after I find the next state. How do I find the target value for A, since we do not know the next state if action A was to be chosen?

Here is my back propagation code:

It learns off of random mini batches of size 32

delta_target is the error of the chosen action

delta_1 is the error for the output layer of the neural network (only 2 outputs)

I set the error of the non chosen action to zero (what should it be set as??)

def replay(self, b_size):
    mini_batch = random.sample(self.memory, b_size) 

    for c_state, c_action, c_reward, n_state, n_done in mini_batch:
        target = c_reward
        if not done:
            target = (c_reward + self.gamma * np.amax(self.predict(n_state)))
        delta_target = self.predict(c_state)[action] - target
        self.learn(delta_target, c_action)

    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay

def learn(self, d_target, act):

    delta_1 = np.zeros((self.action_size, 1))
    delta_1[act] = d_target
    delta_0 = np.dot(web.weight[1].T, delta_1)

    web.weight[1] -= self.alpha * web.layer[1].T * delta_1
    web.weight[0] -= self.alpha * web.layer[0].T * delta_0

    web.bias[2] -= self.alpha * delta_1
    web.bias[1] -= self.alpha * delta_0

Deep Q Learning: question about back propagation

0 Answers0