1

I'm trying to build a deep Q network to play snake. I've run into an issue where the agent doesn't learn and its performance at the end of the training cycle is to repeatedly kill itself. After a bit of debugging, I figured out that the Q values the network predicts are the same every time. The action space is [up, right, down, left] and the network predicts [0, 0, 1, 0]. The training loss does go down over time, but it doesn't seem to make a difference. Here's the training code:

def train(self):
    tf.logging.set_verbosity(tf.logging.ERROR)
    self.build_model()
    for episode in range(self.max_episodes):
        self.current_episode = episode
        env = SnakeEnv(self.screen)
        episode_reward = 0
        for timestep in range(self.max_steps):
            env.render(self.screen)
            state = self.screenshot()
            #state = env.get_state()
            action = None
            epsilon = self.current_eps
            if epsilon > random.random():
                action = np.random.choice(env.action_space) #explore
            else:
                values = self.policy_model.predict(state) #exploit
                action = np.argmax(values)
            experience = env.step(action)
            if(experience['done'] == True):
                episode_reward += experience['reward']
                break
            episode_reward += experience['reward']
            self.push_memory(Experience(experience['state'], experience['action'], experience['reward'], experience['next_state']))
            self.decay_epsilon(episode)
            if self.can_sample_memory():
                memory_sample = self.sample_memory()
                X = []
                Y = []
                for memory in memory_sample:
                    memstate = memory.state
                    action = memory.action
                    next_state = memory.next_state
                    reward = memory.reward
                    max_q = reward + (self.discount_rate * self.replay_model.predict(next_state)) #bellman equation
                    X.append(memstate)
                    Y.append(max_q)
                X = np.array(X)
                X = X.reshape([-1, 600, 600, 2])
                Y = np.array(Y)
                Y = Y.reshape([self.batch_size, 4])
                self.policy_model.fit(X, Y)
        food_eaten = experience["food_eaten"]
        print("Episode: ", episode, " Total Reward: ", episode_reward, " Food Eaten: ", food_eaten)
        if episode % self.target_update == 0:
            self.replay_model.set_weights(self.policy_model.get_weights())
    self.policy_model.save_weights('weights.hdf5')
    pygame.quit()

Here's the network architecture:

    self.policy_model = Sequential()
    self.policy_model.add(Conv2D(8, (5, 5), padding = 'same', activation = 'relu', data_format = "channels_last", input_shape = (600, 600, 2)))
    self.policy_model.add(Conv2D(16, (5, 5), padding="same", activation="relu"))
    self.policy_model.add(Conv2D(32, (5, 5), padding="same", activation="relu"))
    self.policy_model.add(Flatten())
    self.policy_model.add(Dense(16, activation = "relu"))
    self.policy_model.add(Dense(5, activation = "softmax"))
    rms = keras.optimizers.RMSprop(lr = self.learning_rate) 
    self.policy_model.compile(optimizer = rms, loss = 'mean_squared_error')

Here are the hyperparameters:

learning_rate = 1e-4
discount_rate = 0.99
eps_start = 1
eps_end = .01
eps_decay = 1e-5
memory_size = 100000
batch_size = 2
max_episodes = 1000
max_steps = 100000
target_update = 100

I've let it train for the full 1000 episodes and it's pretty bad at the end. Am I doing something wrong with the training algorithm?

EDIT: Forgot to mention that the agent receives a reward of 0.5 for going towards the food, 1 for eating the food, and -1 for dying

EDIT 2: Just read that some DQNs use a stack of 4 consecutive frames as a single sample. Would this be necessary to implement for my environment, considering how simple movements are?

achandra03
  • 101
  • 11

2 Answers2

2

Reinforcement learning algorithms need a very low optimizer learning rate (e.g. 1.e-4 or below) in order not to learn too fast and overfit on a subspace of the environment (looks like your problem). Here you seem to use the default learning rate of your optimizer (rmsprop, which is 0.001 by default).

Anyway, this could be a possible reason :)

Dany Yatim
  • 96
  • 5
1

Pay attention to epsilon decay. It sets the exploration exploration trade-off over time. If your epsilon decay is too big, it will start to exploit a very small (unexplored) space of the state action space. Most of the time with me, at least, early convergence in bad policy was caused by too big epsilon decay.

Guinther Kovalski
  • 1,629
  • 1
  • 7
  • 15