Tensorflow reinforcement Learning Model will barely ever make a decision on its own and will not learn.

Question

I am trying to create a reinforcement learning agent that can buy, sell or hold stock positions. The issue I'm having is that even after over 2000 episodes, the agent still can not learn when to buy, sell or hold. Here is an image from the 2100th episode detailing what I mean, the agent will not take any action unless it is random. The agent learns using a replay memory and I have double and triple checked that there are no errors. Here is the code for the agent: import numpy as np import tensorflow as tf import random from collections import deque from .agent import Agent

class Agent(Agent):
def __init__(self, state_size = 7, window_size = 1, action_size = 3, 
batch_size = 32, gamma=.95, epsilon=.95, epsilon_decay=.95, epsilon_min=.01, 
learning_rate=.001, is_eval=False, model_name="", stock_name="", episode=1):
    """
    state_size: Size of the state coming from the environment
    action_size: How many decisions the algo will make in the end
    gamma: Decay rate to discount future reward
    epsilon: Rate of randomly decided action
    epsilon_decay: Rate of decrease in epsilon
    epsilon_min: The lowest epsilon can get (limit to the randomness)
    learning_rate: Progress of neural net in each iteration
    episodes: How many times data will be run through
    """
    self.state_size = state_size
    self.window_size = window_size
    self.action_size = action_size
    self.batch_size = batch_size
    self.gamma = gamma
    self.epsilon = epsilon
    self.epsilon_decay = epsilon_decay
    self.epsilon_min = epsilon_min
    self.learning_rate = learning_rate
    self.is_eval = is_eval
    self.model_name = model_name
    self.stock_name = stock_name
    self.q_values = []

    self.layers = [150, 150, 150]
    tf.reset_default_graph()
    self.sess = tf.Session(config=tf.ConfigProto(allow_soft_placement = True))

    self.memory = deque()
    if self.is_eval:
        model_name = stock_name + "-" + str(episode)
        self._model_init()
        # "models/{}/{}/{}".format(stock_name, model_name, model_name + "-" + str(episode) + ".meta")
        self.saver = tf.train.Saver()
        self.saver.restore(self.sess, tf.train.latest_checkpoint("models/{}/{}".format(stock_name, model_name)))

        # self.graph = tf.get_default_graph()
        # names=[tensor.name for tensor in tf.get_default_graph().as_graph_def().node]
        # self.X_input = self.graph.get_tensor_by_name("Inputs/Inputs:0")
        # self.logits = self.graph.get_tensor_by_name("Output/Add:0")


    else:
        self._model_init()
        self.sess.run(self.init)
        self.saver = tf.train.Saver()
        path = "models/{}/6".format(self.stock_name)
        self.writer = tf.summary.FileWriter(path)
        self.writer.add_graph(self.sess.graph)

def _model_init(self):
    """
    Init tensorflow graph vars
    """
    # (1,10,9)
    with tf.device("/device:GPU:0"):

        with tf.name_scope("Inputs"):
            self.X_input = tf.placeholder(tf.float32, [None, self.state_size], name="Inputs")
            self.Y_input = tf.placeholder(tf.float32, [None, self.action_size], name="Actions")
            self.rewards = tf.placeholder(tf.float32, [None, ], name="Rewards")

        # self.lstm_cells = [tf.contrib.rnn.GRUCell(num_units=layer)
        #             for layer in self.layers]

        #lstm_cell = tf.contrib.rnn.LSTMCell(num_units=n_neurons, use_peepholes=True)
        #gru_cell = tf.contrib.rnn.GRUCell(num_units=n_neurons)

        # self.multi_cell = tf.contrib.rnn.MultiRNNCell(self.lstm_cells)
        # self.outputs, self.states = tf.nn.dynamic_rnn(self.multi_cell, self.X_input, dtype=tf.float32)

        # self.top_layer_h_state = self.states[-1]

        # with tf.name_scope("Output"):
        #     self.out_weights=tf.Variable(tf.truncated_normal([self.layers[-1], self.action_size]))
        #     self.out_bias=tf.Variable(tf.zeros([self.action_size]))
        #     self.logits = tf.add(tf.matmul(self.top_layer_h_state,self.out_weights), self.out_bias)

        fc1 = tf.contrib.layers.fully_connected(self.X_input, 512, activation_fn=tf.nn.relu)
        fc2 = tf.contrib.layers.fully_connected(fc1, 512, activation_fn=tf.nn.relu)
        fc3 = tf.contrib.layers.fully_connected(fc2, 512, activation_fn=tf.nn.relu)
        fc4 = tf.contrib.layers.fully_connected(fc3, 512, activation_fn=tf.nn.relu)
        self.logits = tf.contrib.layers.fully_connected(fc4, self.action_size, activation_fn=None)

        with tf.name_scope("Cross_Entropy"):
            self.loss_op = tf.losses.mean_squared_error(self.Y_input,self.logits)
            self.optimizer = tf.train.RMSPropOptimizer(learning_rate=self.learning_rate)
            self.train_op = self.optimizer.minimize(self.loss_op)
        # self.correct = tf.nn.in_top_k(self.logits, self.Y_input, 1)

        # self.accuracy = tf.reduce_mean(tf.cast(self., tf.float32))
        tf.summary.scalar("Reward", tf.reduce_mean(self.rewards))
        tf.summary.scalar("MSE", self.loss_op)
        # Merge all of the summaries
        self.summ = tf.summary.merge_all()
        self.init = tf.global_variables_initializer()

def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))

def act(self, state):
    if np.random.rand() <= self.epsilon and not self.is_eval:
        prediction = random.randrange(self.action_size)
        if prediction == 1 or prediction == 2:
            print("Random")
        return prediction

    act_values = self.sess.run(self.logits, feed_dict={self.X_input: state.reshape((1, self.state_size))})
    if np.argmax(act_values[0]) == 1 or np.argmax(act_values[0]) == 2:
        pass
    return np.argmax(act_values[0])

def replay(self, time, episode):
    print("Replaying")
    mini_batch = []
    l = len(self.memory)
    for i in range(l - self.batch_size + 1, l):
        mini_batch.append(self.memory[i])

    mean_reward = []
    x = np.zeros((self.batch_size, self.state_size))
    y = np.zeros((self.batch_size, self.action_size))
    for i, (state, action, reward, next_state, done) in enumerate(mini_batch):
        target = reward
        if not done:
            self.target = reward + self.gamma * np.amax(self.sess.run(self.logits, feed_dict = {self.X_input: next_state.reshape((1, self.state_size))})[0])
        current_q = (self.sess.run(self.logits, feed_dict={self.X_input: state.reshape((1, self.state_size))}))

        current_q[0][action] = self.target
        x[i] = state
        y[i] = current_q.reshape((self.action_size))
        mean_reward.append(self.target)

    #target_f = np.array(target_f).reshape(self.batch_size - 1, self.action_size)
    #target_state = np.array(target_state).reshape(self.batch_size - 1, self.window_size, self.state_size)
    _, c, s = self.sess.run([self.train_op, self.loss_op, self.summ], feed_dict={self.X_input: x, self.Y_input: y, self.rewards: mean_reward}) # Add self.summ into the sess.run for tensorboard
    self.writer.add_summary(s, global_step=(episode+1)/(time+1))

    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay

Once the replay memory is greater than the batch size, it runs the replay function. The code might look a little messy since I have been messing with it for days now trying to figure this out. Here is a screenshot of the MSE from tensorboard. As you can see by the 200th episode the MSE dies out to 0 or almost 0. I'm stumped! I have no idea what is going on. Please help me figure this out. The code is posted here to see the whole thing including the train and eval files. The main agent I have been working on is LSTM.py in tha agents folder. Thanks!

I am pretty sure your epsilon_decay is set way too low. After only a few iterations, your value will be pretty much zero (try, for instance, 0.95^40, which is roughly 0.12). So after only 40 epochs, you are barely making any random predictions anymore. Try setting it to a higher value, and see if the reporting values do change more significantly. — dennlinger, Jul 30 '18 at 06:54
I am not sure what this piece of code means: if prediction == 1 or prediction == 2: print("Random") return prediction — shunyo, Aug 01 '18 at 02:53
It was just so I could see when predictions were random or made by the network. @dennlinger I raised the epsilon decay and have been running it for 1 day now and it seems to be working well. The network has started making decisions on its own and good ones too. Thanks! — Noah Meislik, Aug 01 '18 at 02:56
I will add a answer with maybe a little more details on decay, so we can have an accepted answer on this post. — dennlinger, Aug 01 '18 at 05:24

score 1 · Answer 1 · answered Aug 02 '18 at 07:53

As discussed in the comments of the question, this seems to be a problem of the high learning rate decay.

Essentially, with every episode you multiply your learning rate by some factor j, which means that your learning rate after n episodes/epochs will be equal to
lr = initial_lr * j^n. In our example, the decay is set to 0.95, which means that after only a few iterations, the learning rate will already drop significantly. Subsequently, the updates will only perform minute corrections, and not "learn" very significant changes anymore.

This leads to the question: Why does decay make sense at all? Generally speaking, we want to reach a local optimum (potentially very narrow). To do so, we try and get "relatively close" to such a minimum, and then only do smaller increments that lead us to this optimum. If we would just continue with the original learning rate, it could be that we simply step over the optimal solution every time, and never reach our goal. Visually, the problem can be summed up by this graphic:

Another method besides decay is to simply decrease the learning rate by a certain amount once the algorithm does not reach any significant updates anymore. This avoids the problem of diminishing learning rates purely by having many episodes.

In your case specifically, a higher value for the decay (i.e. a slower decay) seemed to have helped already quite a lot.

score 0 · Answer 2 · answered Aug 09 '18 at 05:20

the Q value in reinforcement learning does not represent 'reward' but 'return' which is the sum of current reward and discount future rewards. When your model enters this 'dead end' of all zero actions, the rewards will be zero based on your setting. Then after a period of time, your replay will be full of memories of 'An action of zero gives the reward of zero', so no matter how you update your model it cannot get out of this dead end.

As @dennlinger said, you may increase your epsilon to let your model have some fresh memories to update, also you could use prioritized experience replay to train on 'useful' experiences.

However, I suggest you look at is the environment itself first. Your model output zeroes because there are no better choices, is that true? As you said you were trading stocks, are you sure there is enough information that leads to a strategy that will lead you to a reward that is larger than zero? I think you need to think through this first before you do any tuning on this. For example, if the stock is moving up or down at a pure random 50/50 chance, then you'll never find a strategy that will make average reward larger than zero.

The reinforcement learning agent might already found out the best one, though it's not what you want.

Tensorflow reinforcement Learning Model will barely ever make a decision on its own and will not learn.

2 Answers2