Model for OpenAI gym's Lunar Lander not converging

Question

I am trying to use deep reinforcement learning with keras to train an agent to learn how to play the Lunar Lander OpenAI gym environment. The problem is that my model is not converging. Here is my code:

import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers

def get_random_action(epsilon):
    return np.random.rand(1) < epsilon

def get_reward_prediction(q, a):
    qs_a = np.concatenate((q, table[a]), axis=0)
    x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
    x[0] = qs_a
    guess = model.predict(x[0].reshape(1, x.shape[1]))
    r = guess[0][0]
    return r

results = []
epsilon = 0.05
alpha = 0.003
gamma = 0.3
environment_parameters = 8
num_of_possible_actions = 4
obs = 15
mem_max = 100000
epochs = 3
total_episodes = 15000

possible_actions = np.arange(0, num_of_possible_actions)
table = np.zeros((num_of_possible_actions, num_of_possible_actions))
table[np.arange(num_of_possible_actions), possible_actions] = 1

env = gym.make('LunarLander-v2')
env.reset()

i_x = np.random.random((5, environment_parameters + num_of_possible_actions))
i_y = np.random.random((5, 1))

model = Sequential()
model.add(Dense(512, activation='relu', input_dim=i_x.shape[1]))
model.add(Dense(i_y.shape[1]))

opt = optimizers.adam(lr=alpha)

model.compile(loss='mse', optimizer=opt, metrics=['accuracy'])

total_steps = 0
i_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
i_y = np.zeros(shape=(1, 1))

mem_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
mem_y = np.zeros(shape=(1, 1))
max_steps = 40000

for episode in range(total_episodes):
    g_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
    g_y = np.zeros(shape=(1, 1))
    q_t = env.reset()
    episode_reward = 0

    for step_number in range(max_steps):
        if episode < obs:
            a = env.action_space.sample()
        else:
            if get_random_action(epsilon, total_episodes, episode):
                a = env.action_space.sample()
            else:
                actions = np.zeros(shape=num_of_possible_actions)

                for i in range(4):
                    actions[i] = get_reward_prediction(q_t, i)

                a = np.argmax(actions)

        # env.render()
        qa = np.concatenate((q_t, table[a]), axis=0)

        s, r, episode_complete, data = env.step(a)
        episode_reward += r

        if step_number is 0:
            g_x[0] = qa
            g_y[0] = np.array([r])
            mem_x[0] = qa
            mem_y[0] = np.array([r])

        g_x = np.vstack((g_x, qa))
        g_y = np.vstack((g_y, np.array([r])))

        if episode_complete:
            for i in range(0, g_y.shape[0]):
                if i is 0:
                    g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0]
                else:
                    g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0] + gamma * g_y[(g_y.shape[0] - 1) - i + 1][0]

            if mem_x.shape[0] is 1:
                mem_x = g_x
                mem_y = g_y
            else:
                mem_x = np.concatenate((mem_x, g_x), axis=0)
                mem_y = np.concatenate((mem_y, g_y), axis=0)

            if np.alen(mem_x) >= mem_max:
                for l in range(np.alen(g_x)):
                    mem_x = np.delete(mem_x, 0, axis=0)
                    mem_y = np.delete(mem_y, 0, axis=0)

        q_t = s

        if episode_complete and episode >= obs:
            if episode%10 == 0:
                model.fit(mem_x, mem_y, batch_size=32, epochs=epochs, verbose=0)

        if episode_complete:
            results.append(episode_reward)
            break

I am running tens of thousands of episodes and my model still won't converge. It will begin to reduce average change in policy over ~5000 episodes while increasing the average reward, but then it goes off the deep end and the average reward per episode actually goes down after that. I've tried messing with the hyperparameters, but I haven't gotten anywhere with that. I'm trying to model my code after the DeepMind DQN paper.

score 3 · Accepted Answer · answered Jul 19 '18 at 15:19

You might want to change your get_random_action function to decay epsilon with each episode. After all, assuming your agent can learn an optimal policy, at some point you won't want to take random actions at all, right? Here's a slightly different version of get_random_action that would do this for you:

def get_random_action(epsilon, total_episodes, episode):
        explore_prob = epsilon - (epsilon * (episode / total_episodes))
        return np.random.rand(1) < explore_prob

In this modified version of your function, epsilon will decrease slightly with each episode. This may help your model converge.

There are a handful of ways to decay a parameter. For more info, check out this Wikipedia article.

Hmmm... I will try this and get back to you – Gregory T. Cerchione Jul 19 '18 at 15:24 — Gregory T. Cerchione, Jul 19 '18 at 15:24

score 1 · Answer 2 · answered Jul 26 '18 at 13:57

I recently implemented this successfully. https://github.com/tianchuliang/techblog/tree/master/OpenAIGym

Basically, I let the agent run randomly for 3000 frames while collecting these as initial training data (states) and labels (rewards), then after that I train my neural net model every 100 frames and let the model make decisions as to what action results in best score.

See my github, it may help. Oh, my training iterations are on YouTube too, https://www.youtube.com/watch?v=wrrr90Pevuw https://www.youtube.com/watch?v=TJzKbFAlKa0 https://www.youtube.com/watch?v=y91uA_cDGGs

Model for OpenAI gym's Lunar Lander not converging

2 Answers2