3

I am trying to follow DeepMind's paper on Q-learning for the game breakout, and so far the performance is not improving i.e. it is not learning anything at all. Instead of experience replay , i am just running game, saving some data and training and then again running game. I've put up comments to explain my implementation, any help is much appreciated. Also i may be missing some key points, please have a look.

I am sending 4 frames as input and a one-hot matrix of key pressed multiplied with reward for that key press. Also i am trying with BreakoutDetermistic-v0, as mentioned in the paper

import gym
import tflearn
import numpy as np
import cv2
from collections import deque
from tflearn.layers.estimator import regression
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_2d


game = "BreakoutDeterministic-v4"
env = gym.make(game)
env.reset()


LR = 1e-3
num_games = 10     # arbitrary number, not final
num_frames = 500
possible_actions = env.action_space.n
accepted_score = 2
MODEL_NAME = 'data/Model_{}'
gamma = 0.9
epsilon = 0.7
generations = 30    # arbitrary number, not final
height = 84
width = 84

# instead of using experience replay, i'm simply calling this function in generations to generate training data
def play4data(gen):
    training_data = []
    for i in range(num_games):

        score = 0
        data = []
        prev_observation = []
        env.reset()
        done = False
        d = deque()

        while not done:

            # env.render()

            # if it's 0th generation, model hasn't been trained yet, so can't call predict funtion
            # or if i want to take a random action based on some fixed epsilon value
            # or if it's in later gens , but doesn't have 4 frames yet , to send to model
            if gen == 0 or len(prev_observation)==0 or np.random.rand() <= epsilon or len(d) < 4:
                theta = np.random.randn(possible_actions)
            else:
                theta = model.predict(np.array(d).reshape(-1, 4, height, width))[0]

            # action is a single value, namely max from an output like [0.00147357 0.00367402 0.00365852 0.00317618]
            action = np.argmax(theta)
            # action = env.action_space.sample()

            # take an action and record the results
            observation, reward, done, info = env.step(action)


            # since observation is 210 x 160 pixel image, resizing to 84 x 84
            observation = cv2.resize(observation, (height, width))

            # converting image to grayscale
            observation = cv2.cvtColor(observation, cv2.COLOR_RGB2GRAY)

            # d is a queue of 4 frames that i pass as an input to the model
            d.append(observation)
            if len(d) > 4:
                d.popleft()

            # for gen 0 , since model hasn't been trained yet, Q_sa is set to zeros or random
            # or i dont yet have 4 frames to call predict
            if gen == 0 or len(d) < 4:
                Q_sa = np.zeros(possible_actions)
            else:
                Q_sa = model.predict(np.array(d).reshape(-1, 4, height, width))[0]

            # this one is just total score after each game
            score += reward

            if not done:
                Q = reward + gamma*np.amax(Q_sa)
            else:
                Q = reward

            # instead of mask, i just used list comparison to multiply with Q values
            # theta is one-hot after this, like  [0.         0.         0.         0.00293484]
            theta = (theta == np.amax(theta)) * 1 * Q


            # only appending those actions, for which some reward was generated
            # otherwise data-set becomes mostly zeros and model is 99 % accurate by just predicting zeros
            if len(prev_observation) > 0 and len(d) == 4 np.sum(theta) > 0:
                data.append([d, theta])

            prev_observation = observation

            if done:
                break

        print('gen {1} game {0}: '.format(i, gen) + str(score))

        # only taking those games for which total score at the end of game was above accpetable score
        if score >= accepted_score:
            for d in data:
                training_data.append(d)

    env.reset()
    return training_data


# exact model described in DeepMind paper, just added a layer to end for 18 to 4
def simple_model(width, height, num_frames, lr, output=9, model_name='intelAI.model'):
    network = input_data(shape=[None, num_frames, width, height], name='input')
    conv1 = conv_2d(network, 8, 32,strides=4, activation='relu', name='conv1')
    conv2 = conv_2d(conv1, 4, 64, strides=2, activation='relu', name='conv2')
    conv3 = conv_2d(conv2, 3, 64, strides=1, activation='relu', name='conv3')
    fc4 = fully_connected(conv3, 512, activation='relu')
    fc5 = fully_connected(fc4, 18, activation='relu')
    fc6 = fully_connected(fc5, output, activation='relu')

    network = regression(fc6, optimizer='adam',
                         loss='mean_square',
                         learning_rate=lr, name='targets')

    model = tflearn.DNN(network,
                        max_checkpoints=0, tensorboard_verbose=0, tensorboard_dir='log')
    return model


# defining/ declaring the model
model = simple_model(width, height, 4, LR, possible_actions)

# this function is responsible for training the model
def train2play(training_data):

    X = np.array([i[0] for i in training_data]).reshape(-1, 4, height, width)
    Y = [i[1] for i in training_data]


    # X is the queue of 4 frames
    model.fit({'input': X}, {'targets': Y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')

# repeating the whole process in terms of generations
# training again and again after playing for set number of games
for gen in range(generations):

    training_data =  play4data(gen)
    np.random.shuffle(training_data)
    train2play(training_data)

    model.save(MODEL_NAME.format(game))
Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55

1 Answers1

2

I did not inspect every single line of code in detail, so I may have missed some things, but here are some things that may be worth looking into:

  • For how many frames (e.g. how many step() calls) are you training? I don't know by heart how much time DeepMind's DQN needed for this specific game, but many atari games really do require millions of steps before you get even just noticeable improvements in performance. It will be very difficult to tell whether it's working as intended or not from just a small amount of training.
  • Unless I missed it, it looks like you're not decaying epsilon over time. A starting value of 0.7 is fine (or I think it's more common to have even higher at the start), but it really should be lowered over time, ending at a value like 0.1 or 0.01. If you keep it that high it will start to limit how much you can learn.
  • You mentioned that you are intentionally not using Experience Replay, but Experience Replay was described in the DQN paper as being an important component for stable learning. One hypothesis for its importance is that it removes/reduces correlation between your samples of experience, which is crucial for the training of a Neural Network (if all of the samples you give to your network look alike, because they were all generated very recently from the same policy, it will not get sufficiently varied training data).
  • I don't see you using a Target Network (a separate copy of the network used to compute the Q_sa learning targets, which only occasionally gets updated by copying the parameters of the learning network into it). Like Experience Replay, this was described in the DQN paper as an important component which stabilized the learning process. I don't think that you can reasonably expect a stable learning process without it.
Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • i couldnt understand target network, what you want to say is i use 2 models ( of same design) and i know one has to get rewards ( after reward = reward + q_sa) as output. but what does the other network get as output ? – Shubham Debnath Mar 21 '18 at 20:15
  • @ShubhamDebnath Two networks, let's call them `A` and `B`, identical architectures. `A` should be used to compute `theta` in your code (predictions made in order to select actions to play). This is also the network you should train directly (`model.fit()` in your `train2play` function currently). `B`, the target network, should be used to compute the `Q_sa` values in your code. At certain intervals, **but not too often** (for example, once every `10K` steps), you should copy the parameters of `A` into `B`). – Dennis Soemers Mar 21 '18 at 20:23
  • Intuitively, `B` (the "target network") is used to compute parts of the targets that `A` is updated towards, and `B` itself updates/learns much more slowly by only sometimes getting the weights of `A` copied into `B`. You may find some more useful information in [this answer](https://stackoverflow.com/a/48963537/6735980) which I've written earlier, and the comments to that answer. – Dennis Soemers Mar 21 '18 at 20:25
  • about how many steps, deepmind used about a million for a step, and i was not seeing any improvement even after an iteration of 10000 – Shubham Debnath Mar 21 '18 at 20:28
  • and how does one copy parameters to other network, for tflearn ? – Shubham Debnath Mar 21 '18 at 20:29
  • I'm not sure about `tflearn` specifically. For `tensorflow` in general, you can find the specific line of code performing the copy in a good DQN implementation [here](https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/dqn.py#L326) (where the `copy_model_parameters()` function is implemented starting at line 150). – Dennis Soemers Mar 21 '18 at 20:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167292/discussion-between-shubham-debnath-and-dennis-soemers). – Shubham Debnath Mar 21 '18 at 21:01