Deep Q Learning - Cartpole Environment

Question

I have a concern in understanding the Cartpole code as an example for Deep Q Learning. The DQL Agent part of the code as follow:

class DQLAgent:
def __init__(self, env):
    # parameter / hyperparameter
    self.state_size = env.observation_space.shape[0]
    self.action_size = env.action_space.n
    
    self.gamma = 0.95
    self.learning_rate = 0.001 
    
    self.epsilon = 1  # explore
    self.epsilon_decay = 0.995
    self.epsilon_min = 0.01
    
    self.memory = deque(maxlen = 1000)
    
    self.model = self.build_model()
    
    
def build_model(self):
    # neural network for deep q learning
    model = Sequential()
    model.add(Dense(48, input_dim = self.state_size, activation = "tanh"))
    model.add(Dense(self.action_size,activation = "linear"))
    model.compile(loss = "mse", optimizer = Adam(lr = self.learning_rate))
    return model

def remember(self, state, action, reward, next_state, done):
    # storage
    self.memory.append((state, action, reward, next_state, done))

def act(self, state):
    # acting: explore or exploit
    if random.uniform(0,1) <= self.epsilon:
        return env.action_space.sample()
    else:
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

def replay(self, batch_size):
    # training
    if len(self.memory) < batch_size:
        return
    minibatch = random.sample(self.memory,batch_size)
    for state, action, reward, next_state, done in minibatch:
        if done:
            target = reward 
        else:
            target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
        train_target = self.model.predict(state)
        train_target[0][action] = target
        self.model.fit(state,train_target, verbose = 0)
        
def adaptiveEGreedy(self):
    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay

In the training section, we found our target and train_target. So why did we set train_target[0][action] = target here?

Every predict made while learning is not correct, but thanks to error calculation and backpropagation, the predict made at the end of the network will get closer and closer, but when we make train_target[0][action] = target here the error becomes 0, and in this case, how will the learning be?

gekrone · Accepted Answer · 2021-05-31T22:21:42.433

self.model.predict(state) will return a tensor of shape of (1, 2) containing the estimated Q values for each action (in cartpole the action space is {0,1}). As you know the Q value is a measure of the expected reward.

By setting self.model.predict(state)[0][action] = target (where target is the expected sum of rewards) it is creating a target Q value on which to train the model. By then calling model.fit(state, train_target) it is using the target Q value to train said model to approximate better Q values for each state.

I don't understand why you are saying that the loss becomes 0: the target is set to the discounted sum of rewards plus the current reward

target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])

while the network prediction for the highest Q value is

np.amax(self.model.predict(next_state)[0])

The loss between the target and the predicted values is what is used to train the model.

Edit - more detailed explaination

(you can ignore the [0] to the predicted values, as it is just to access the right column and unimportant in the understanding)

The target variable is set to the sum between the current reward and the estimated sum of future rewards, or the Q value. Note that this variable is called target but it is not the target of the network, but the target Q value for the chosen action.

The train_target variable is used as what you call the "dataset". It represents the target of the network.

train_target = self.model.predict(state)
train_target[0][action] = target

You can clearly see that:

train_target[<taken action>] = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
train_target[<any other action>] = <prediction from the model>

the loss (mean squared error):

prediction = self.model.predict(state)
loss = (train_target - prediction)^2

For any line of the that is not the the loss is 0. For the one line that has been set the loss is

(target - prediction[action])^2

or

((reward + self.gamma*np.amax(self.model.predict(next_state)[0])) - self.model.predict(state)[0][action])^2

which is clearly different from 0.

Note that this agent is not ideal. I would strongly recommend the use of a target model instead of creating target Q values that way. Check out this answer as for why.

I have been thinking about your answer since you wrote it, but the problem still hasn't be clear in my mind. We already have the trained Q values that is the **train_target**. I think of the **trained_target** as our prediction. We have already also **target** that is like dataset if we resemble it to the deep learning. And the loss is: (**target**-**prediction**)^2 that is, (**target**-**train_target**)^2. So I still didn't understand `train_target[0][action] = target` part of the training. @gekrone — jasmin, May 31 '21 at 20:22
I'll try to make myself clearer, check out the edited question and let me know if you understand — gekrone, May 31 '21 at 22:22
Thank you very much! Now I see what I misunderstood before your answer. @gekrone — jasmin, Jun 01 '21 at 08:23
I am sorry, there is another point that confuses me when I think about it again. We say that; `train_target[0][action] = target` that is, `self.model.predict(state)[0][action]=target`. Then, `loss=(target - self.model.predict(state)[0][action])^2`. And It is `loss=(target - target)^2` . This is the point that makes me think the loss is 0. @gekrone — jasmin, Jun 01 '21 at 09:09
When you do ```train_target[0][action] = target``` you are just changing __train_target__, which is a variable. You are not changing directly your model (which is also not possible this way), just a copy of its outputs. So ```self.model.predict(state)[0][action]``` is not the target, it's the predicted q value of your __current model__ (on a particular action). — gekrone, Jun 01 '21 at 09:23

Deep Q Learning - Cartpole Environment

1 Answers1

Edit - more detailed explaination