The DQN model cannot correctly come out the expected scores

Question

I am working on a DQN training model of the game "CartPole-v1". In this model, the system did not remind any error information in the terminal. However, The result evaluation got worse.This is the output data:

episode: 85 score: 18 avarage score: 20.21 epsilon: 0.66
episode: 86 score: 10 avarage score: 20.09 epsilon: 0.66
episode: 87 score: 9 avarage score: 19.97 epsilon: 0.66
episode: 88 score: 14 avarage score: 19.90 epsilon: 0.65
episode: 89 score: 9 avarage score: 19.78 epsilon: 0.65
episode: 90 score: 10 avarage score: 19.67 epsilon: 0.65
episode: 91 score: 14 avarage score: 19.60 epsilon: 0.64
episode: 92 score: 13 avarage score: 19.53 epsilon: 0.64
episode: 93 score: 17 avarage score: 19.51 epsilon: 0.64
episode: 94 score: 10 avarage score: 19.40 epsilon: 0.63
episode: 95 score: 16 avarage score: 19.37 epsilon: 0.63
episode: 96 score: 16 avarage score: 19.33 epsilon: 0.63
episode: 97 score: 10 avarage score: 19.24 epsilon: 0.62
episode: 98 score: 13 avarage score: 19.17 epsilon: 0.62
episode: 99 score: 12 avarage score: 19.10 epsilon: 0.62
episode: 100 score: 11 avarage score: 19.02 epsilon: 0.61
episode: 101 score: 17 avarage score: 19.00 epsilon: 0.61
episode: 102 score: 11 avarage score: 18.92 epsilon: 0.61
episode: 103 score: 9 avarage score: 18.83 epsilon: 0.61

I'll show my code here. Firstly I constructed a neuron network:

import random
from torch.autograd import Variable
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import gym
from collections import deque

# construct a neuron network (prepare for step1, step3.2 and 3.3)
class DQN(nn.Module):
    def __init__(self, s_space, a_space) -> None:

        # inherit from DQN class in pytorch
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(s_space, 360)
        self.fc2 = nn.Linear(360, 360)
        self.fc3 = nn.Linear(360, a_space)


    # DNN operation architecture
    def forward(self, input):
        out = self.fc1(input)
        out = F.relu(out)
        out = self.fc2(out)
        out = F.relu(out)
        out = self.fc3(out)
        return out

Instead of newing an agent class, I directly created the select function, which is used for select the corresponding action according to epsilon, and the back propagation function by gradient globally:

# define the action selection according to epsilon using neuron network (prepare for step3.2)
def select(net, epsilon, env, state):

    # randomly select an action if not greedy
    if(np.random.rand() <= epsilon):
        action = env.action_space.sample()
        return action
    # select the maximum reward action by NN and the given state if greedy
    else:
        actions = net(Variable(th.Tensor(state))).detach().numpy()
        action = np.argmax(actions[0])
        return action

This is the back propagation function and the decreasing of epsilon:

# using loss function to improve neuron network (prepare for step3.3)
def backprbgt(net, store, batch_size, gamma, learning_rate):
   
    # step1: create loss function and Adam optimizer
    loss_F = nn.MSELoss()
    opt = th.optim.Adam(net.parameters(),lr=learning_rate)

    # step2: extract the sample in memory
    materials = random.sample(store, batch_size)

    # step3: Calculate arguments of loss function:

    for t in materials:

        Q_value = net(Variable(th.Tensor(t[0])))

        # step3.1 Calculate tgt_Q_value in terms of greedy:
        reward = t[3]
        if(t[4] == True):
            tgt = reward
        else:
            tgt = reward + gamma * np.amax(net(Variable(th.Tensor(t[2]))).detach().numpy()[0])
        # print(tgt)
        # tgt_Q_value = Variable(th.Tensor([[float(tgt)]]), requires_grad=True)

        # print("Q_value:",Q_value)
        Q_value[0][t[1]] = tgt
        tgt_Q_value = Variable(th.Tensor(Q_value))
        # print("tgt:",tgt_Q_value)

        # step3.2 Calculate evlt_Q_value
        
        # index = th.tensor([[t[1]]])
        # evlt_Q_value = Q_value.gather(1,index)  # gather tgt into the corresponding action
        evlt_Q_value = net(Variable(th.Tensor(t[0])))
        # print("evlt:",evlt_Q_value)


        # step4: backward and optimization
        loss = loss_F(evlt_Q_value, tgt_Q_value)
        # print(loss)
        opt.zero_grad()
        loss.backward()
        opt.step()

# step5: decrease epsilon for exploitation
def decrease(epsilon, min_epsilon, decrease_rate):
    if(epsilon > min_epsilon):
        epsilon *= decrease_rate

After that, the parameters and training progress are like this:

# training process

# step 1: set parameters and NN
episode = 1500
epsilon = 1.0
min_epsilon = 0.01
dr = 0.995
gamma = 0.9
lr = 0.001
batch_size = 40
memory_store = deque(maxlen=1500)

# step 2: define game category and associated states and actions
env = gym.make("CartPole-v1")
s_space = env.observation_space.shape[0]
a_space = env.action_space.n

net = DQN(s_space, a_space)
score = 0

# step 3: trainning
for e in range(0, episode):

    # step3.1: at the start of each episode, the current result should be refreshed

    # set initial state matrix
    s = env.reset().reshape(-1, s_space)

    # step3.2: iterate the state and action
    for run in range(500):

        # select action and get the next state according to current state "s"
        a = select(net, epsilon, env, s)
        obs, reward, done, info = env.step(a)

        next_s = obs.reshape(-1,s_space)
        s = next_s

        score += 1

        if(done == True):
            reward = -10.0
            memory_store.append((s,a,next_s,reward,done))
            avs = score / (e+1)
            print("episode:", e+1, "score:", run+1, "avarage score: {:.2f}".format(avs), "epsilon: {:.2}".format(epsilon))
            break

        # safe sample data
        memory_store.append((s, a, next_s, reward, done))

        if(run == 499):
            print("episode:", e+1, "score:", run+1, "avarage score:", avs)

    # step3.3 whenever the episode reach the integer time of batch size, 
    # we should backward to implore the NN
    if(len(memory_store) > batch_size):
        backprbgt(net, memory_store, batch_size, gamma, lr) # here we need a backprbgt function to backward
        if(epsilon > min_epsilon):
            epsilon = epsilon * dr

In the entire progress of training, there was no error or exception reminds. However, instead of the score increasing, the model performed lower score in the later steps. I think the theory of this model is correct but cannot find where the error appears although I tried lots of methods improving my code, including rechecking the input arguments of network, modifing the data structure of two arguments of loss function, etc. I paste my code here and hope to get some help on how to fix it. Thanks!

Huge mistake in step 3.2 -- line `s = next_s` should be in end of the episode loop. But now because of this line, in `memory_store` for `state` and `next_s` you keep same arrays. Therefore Bellman equation doesn't working and no training occurs. — draw, Apr 17 '22 at 08:12
Main metric of success of a model is `avs`, which based on `score`, but you keep it global for all episodes, so when model will start to perform well on late iterations you'll not see it, because of small relative change of good episode compared to a lot of bad episodes. So I'd suggest for now take rid of `score` variable and print only number of steps in episode. Or somehow rewrite updating of score to keep information only about K last episodes, not all of them. — draw, Apr 17 '22 at 08:22
While you have `batch_size` variable, in `backprbgt` function you're using actually batch of size 1, you updating weights of network on every step in batch. Thus totally "batch size" gradient step on single squared differences, while it should be 1 gradient step on mean of squares of differences computed on whole batch. — draw, Apr 17 '22 at 08:40
Thank you for your answer. About the last point you mentioned, how can I fix it? Many thanks! @draw — speedhawk1, Apr 17 '22 at 09:34

score 1 · Accepted Answer · answered Apr 17 '22 at 15:08

Check out the code. For most parts it's the same as in snippet above, but there is some changes:

for step in replay buffer (which is called in code memory_store) namedtuple is used, and in update it's much easier to read t.reward, than looking what every index doing in step t
class DQN has method update, it's better to keep optimizer as attribute of class, than create it every time when calling function backprbgt
usage of torch.autograd.Variable here is unnecessary, so it's also was taken away
update in backprbgt taken per batch
decrease size of hidden layer from 360 to 32, while increase batch size from 40 to 128
updating network once in 10 episodes, but on 10 batches in replay buffer
average score prints out every 50 episodes based on 10 last episodes
add seeds

Also for RL it's take a long time to learn anything, so hoping that after 100 episodes it'll be close to even 100 points is somewhat optimistic. For the code in link averaging on 5 runs results in following dynamics

X axis -- number of episodes (yeah, 70 K, but it's like 20 minutes of real time)

Y axis -- number of steps in episode

As can be seen after 70K episodes algorithm achieves reward comparable to highest possible in this environment (highest -- 500). By tweaking hyperparameters faster rate can be achieved, but also remember it's DQN without any modification.

The DQN model cannot correctly come out the expected scores

1 Answers1