Deep Q Network is not learning

Question

I tried to code a Deep Q Network to play Atari games using Tensorflow and OpenAI's Gym. Here's my code:

import tensorflow as tf
import gym
import numpy as np
import os

env_name = 'Breakout-v0'
env = gym.make(env_name)
num_episodes = 100
input_data = tf.placeholder(tf.float32,(None,)+env.observation_space.shape)
output_labels = tf.placeholder(tf.float32,(None,env.action_space.n))

def convnet(data):
    layer1 = tf.layers.conv2d(data,32,5,activation=tf.nn.relu)
    layer1_dropout = tf.nn.dropout(layer1,0.8)
    layer2 = tf.layers.conv2d(layer1_dropout,64,5,activation=tf.nn.relu)
    layer2_dropout = tf.nn.dropout(layer2,0.8)
    layer3 = tf.layers.conv2d(layer2_dropout,128,5,activation=tf.nn.relu)
    layer3_dropout = tf.nn.dropout(layer3,0.8)
    layer4 = tf.layers.dense(layer3_dropout,units=128,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    layer5 = tf.layers.flatten(layer4)
    layer5_dropout = tf.nn.dropout(layer5,0.8)
    layer6 = tf.layers.dense(layer5_dropout,units=env.action_space.n,activation=tf.nn.softmax,kernel_initializer=tf.zeros_initializer)
    return layer6

logits = convnet(input_data)
loss = tf.losses.sigmoid_cross_entropy(output_labels,logits)
train = tf.train.GradientDescentOptimizer(0.001).minimize(loss)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
discount_factor = 0.5

with tf.Session() as sess:
    sess.run(init)
    for episode in range(num_episodes):
        x = []
        y = []
        state = env.reset()
        feed = {input_data:np.array([state])}
        print('episode:', episode+1)
        while True:
            x.append(state)
            if (episode+1)/num_episodes > np.random.uniform():
                Q = sess.run(logits,feed_dict=feed)[0]
                action = np.argmax(Q)
            else:
                action = env.action_space.sample()
            state,reward,done,info = env.step(action)
            Q = sess.run(logits,feed_dict=feed)[0]
            new_Q = np.zeros(Q.shape)
            new_Q[action] = reward+np.amax(Q)*discount_factor
            y.append(new_Q)
            if done:
                break

        for sample in range(len(x)):
            _,l = sess.run([train,loss],feed_dict={input_data:[x[sample]],output_labels:[y[sample]]})
            print('training loss on sample '+str(sample+1)+': '+str(l))
    saver.save(sess,os.getcwd()+'/'+env_name+'-DQN.ckpt')

The Problem is that:

The loss isn't decreasing while training and is always somewhere around 0.7 or 0.8
When I test the network on the Breakout environment even after I trained it for 1000 episodes, the actions still seem kind of random and it rarely hits the ball.

I already tried to use different loss functions (softmax crossentropy and mean squared error), use another optimizer (Adam) and increasing the learning rate but nothing changed.

Can someone tell me how to fix this?

score 12 · Answer 1 · edited Oct 21 '19 at 18:25

Here are some things that stand out that you could look into (it's always difficult in these kinds of cases to tell for sure without trying exactly which issue(s) is/are the most important ones):

100 episodes does not seem like a lot. In the image below, you see learning curves of some variants of Double DQN (slightly more advanced than DQN) on Breakout (source). Training time on the x-axis is measured in millions of frames there, not in episodes. I don't know exactly where 100 episodes would be on that x-axis, but I don't think it would be far in. It may simply not be reasonable to expect any kind of decent performance yet after 100 episodes.

_{(source: openai.com)}

It looks like you're using dropout in your networks. I'd recommend getting rid of the dropout. I don't know 100% for sure that it's bad to use dropout in Deep Reinforcement Learning, but 1) it's certainly not common, and 2) intuitively it doesn't seem necessary. Dropout is used to combat overfitting in supervised learning, but overfitting is not really much of a risk in Reinforcement Learning (at least, not if you're just trying to train for a single game at a time like you are here).
discount_factor = 0.5 seems extremely low, this is going to make it impossible to propagate long-term rewards back to more than a handful of actions. Something along the lines of discount_factor = 0.99 would be much more common.
if (episode+1)/num_episodes > np.random.uniform():, this code looks like it's essentially decaying epsilon from 1.0 - 1 / num_episodes in the first episode to 1.0 - num_episodes / num_episodes = 0.0 in the last episode. With your current num_episodes = 100, this means it's decaying from 0.99 to 0.0 over 100 episodes. That seems to me like it's decaying way too quickly. For reference, in the original DQN paper, epsilon is slowly decayed linearly from 1.0 to 0.1 over 1 million frames, and kept fixed forever after.
You're not using Experience Replay, and not using a separate Target network, as described in the original DQN paper. All of the points above are significantly easier to look into and fix, so I'd recommend that first. That might already be enough to actually start seeing some better-than-random performance after learning, but will likely still perform worse than it would with these two additions.

Sorry for not having replied in so long. I increased the discount factor to 0.99, removed the dropout layers, added experience replay with a sample size of 64 and let that run for 5000 episodes (1,2 Million frames) but its performance is still indistinguishable from random play. Any idea what I could do? — Kay Jersch, May 07 '18 at 17:31
If I look at those plots in my answer, it seems like all of the algorithms in that plot (which are all slightly more advanced than vanilla DQN) only start increasing above an average episode reward of `0` at about 10% of the first "block". That first "block" in the figure is for 50 million frames seen by the agent, so the 10% point would be at roughly 5 million frames. Based on that, it does seem like you really may have to go over 5 million frames instead of 1.2 million before you start seeing anything better than random — Dennis Soemers, May 07 '18 at 18:48
Apart from that, if with sample size of 64 for experience replay you mean the size of the experience replay buffer, that seems rather low. I believe values like... 100K or 1 million are more common, not 100% sure off the top of my head, see the DQN paper. Did you also already look into the third point in my answer (the decaying of `epsilon`)? — Dennis Soemers, May 07 '18 at 18:49
I thought that with the decaying epsilon you meant the step size that resulted from training the network for just 100 episodes. Could you explain it to me again, please? — Kay Jersch, May 08 '18 at 18:09
@KayJersch A yeah you're right, in your code, by having changed the number of episodes you're training for, you'll also have increased the `epsilon`. You may still want to make sure it doesn't drop below a certain value (like `0.1`) though. For example, by changing that if-statement to something like: `if max(0.1, (episode+1)/num_episodes) > np.random.uniform():` — Dennis Soemers, May 08 '18 at 18:30

score 1 · Answer 2 · answered Apr 15 '18 at 13:28

At first let us describe the phenomena in detail. The error function of the neural network can have a value between 1.0 (maximum error) and 0.0 (goal). The idea behind a learning algorithm is to bring down the error function down to zero which means, that the agent plays the game perfect. At the beginning the learning works well, the error value is reducing, but then the curve is parallel at a certain level. That means the CPU is calculating huge amount of data, the CPU consumes energy, but the error value is not reducing anymore.

The good news is, that it has nothing to do with your sourcecode. Your implementation of the Deep Q network is great, I would even assume, that your sourcecode looks better than the code of the average programmer. The problem has to do with the difficulty of the environment in OpenAI gym. That means, on easy games like “bring the player to a goal position” the network learns good, while on difficult problems like “Montezuma's Revenge” the above described problem with a constant error function is occurring. Overcome the problem is not so easy as it looks like. It is not a problem of finetuning a neural network but to invent a new way of handling complex games. In the literature strategies like hierarchical problem solving, natural language grounding and domain specific ontologies are used to overcome the problem.

Deep Q Network is not learning

2 Answers2