C51 reinforcement learning algorithm extremely slow

Question

I am applying reinforcement learning on a time series prediction problem. Until now, I have implemented a dueling DDQN algorithm with LSTM which seems to give some pretty good results, though sometimes slow to converge depending on the exact problem. I have then used C51 distributional reinforcement learning to compare the performance (which I expect to lead to better good results).

I have slightly adapted the C51 google code dopamine to be integrated into my code (network and training part). I am also using double Q learning for selection of next state action (which original code does not use). But, the problem is that it is really very very slow to execute. For comparaison, my previous dueling DDQN used to take 3.5 hours to train for 50000 episodes, however C51 algorithm has now taken almost 10 hours and yet has only reached 3000 episodes.

I am wondering if there is something wrong with my adaptation of the code, or if C51 algorithm is really that slow. I am using an NVidia Geforce RTX 2080 Ti.

Here is the network part:

#network part
self.weights_initializer = tf.contrib.slim.variance_scaling_initializer(factor=1.0 / np.sqrt(3.0), mode='FAN_IN', uniform=True)
self.net = tf.contrib.slim.fully_connected(
  self.rnn, # output of an LSTM
  num_actions * num_atoms,
  activation_fn=None,
  weights_initializer=self.weights_initializer)

self.logits = tf.reshape(self.net, [-1, num_actions, num_atoms])
self.probabilities = tf.contrib.layers.softmax(self.logits)
self.q_values = tf.reduce_sum(self._support * self.probabilities, axis=2)

self.predict = tf.argmax(self.q_values,1)

self.actions = tf.placeholder(shape=[None],dtype=tf.int32)    

self.target_distribution = tf.placeholder(shape=[None,num_atoms],dtype=tf.float32)

# size of indices: batch_size x 1.
self.indices = tf.range(tf.shape(self.logits)[0])[:, None]
# size of reshaped_actions: batch_size x 2.
self.reshaped_actions = tf.concat([self.indices, self.actions[:, None]], 1)
# For each element of the batch, fetch the logits for its selected action.
self.chosen_action_logits = tf.gather_nd(self.logits,
                                self.reshaped_actions)

self.td_error = tf.nn.softmax_cross_entropy_with_logits(labels=self.target_distribution,logits=self.chosen_action_logits)


# divide by the real length of episodes instead of averaging which is incorrect
self.loss = tf.cast(tf.reduce_sum(self.td_error), tf.float64) / tf.cast(tf.reduce_sum(self.seq_len), tf.float64)

if apply_grad_clipping:
   # calculate gradients and clip them to handle outliers
   tvars = tf.trainable_variables()
   grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), grad_clipping)
   self.updateModel = optimizer.apply_gradients(
        zip(grads, tvars),
        name="updateModel")
else:
   self.updateModel = optimizer.minimize(self.loss, name="updateModel")

And here's the training part:

# training part
if i >= pre_train_episodes:
        #Reset the lstm's hidden state
        state_train = np.zeros((num_layers, 2, batch_size, h_size))
        #Get a random batch of experiences.
        trainBatch = myBuffer.sample(batch_size)
        #Below we perform the Double-DQN update to the target Q-values
        num_samples = batch_size*trace_length
        # size of rewards: batch_size x 1
        rewards = trainBatch[:,2][:, None]

        # size of tiled_support: batch_size x num_atoms
        tiled_support = tf.tile(mainQN._support, [num_samples])
        tiled_support = tf.reshape(tiled_support, [num_samples, num_atoms])

        # size of target_support: batch_size x num_atoms
        is_terminal_multiplier = -(np.array(trainBatch[:,4]) - 1)
        # Incorporate terminal state to discount factor.
        # size of gamma_with_terminal: batch_size x 1
        gamma_with_terminal = gamma * is_terminal_multiplier
        gamma_with_terminal = gamma_with_terminal[:, None]

        target_support = rewards + gamma_with_terminal * tiled_support


        next_qt_argmax = sess.run([mainQN.predict], feed_dict={\
                            mainQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            mainQN.trainLength:trace_length,mainQN.state_in:state_train,mainQN.batch_size:batch_size})
        next_qt_argmax = np.reshape(next_qt_argmax, [-1, 1])
        probabilities = sess.run(targetQN.probabilities, feed_dict={\
                            targetQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            targetQN.trainLength:trace_length,targetQN.state_in:state_train,targetQN.batch_size:batch_size})
        batch_indices = np.arange(num_samples)[:, None]
        batch_indexed_next_qt_argmax = np.concatenate([batch_indices, next_qt_argmax], axis=1)


        # size of next_probabilities: batch_size x num_atoms
        next_probabilities = tf.gather_nd(probabilities, batch_indexed_next_qt_argmax)


        target_distribution = project_distribution(target_support, next_probabilities, mainQN._support)
        target_distribution = target_distribution.eval()

        loss, _, _ = sess.run([mainQN.loss, mainQN.check_ops, mainQN.updateModel], \
                            feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.target_distribution:target_distribution,\
                            mainQN.actions:trainBatch[:,1],mainQN.trainLength:trace_length,\
                            mainQN.state_in:state_train,mainQN.batch_size:batch_size})

        # perform soft/hard update frequently
        if i % update_target_freq == 0 or update_target_freq == 1 or softUpdate == True:
            updateTarget(targetOps,sess)

auxiliary function:

# function used above to project the distribution on the provided support
def project_distribution(supports, weights, target_support,
                     validate_args=False):
"""Projects a batch of (support, weights) onto target_support.
Based on equation (7) in (Bellemare et al., 2017):
https://arxiv.org/abs/1707.06887
In the rest of the comments we will refer to this equation simply as Eq7.
This code is not easy to digest, so we will use a running example to  clarify
what is going on, with the following sample inputs:
* supports =       [[0, 2, 4, 6, 8],
                    [1, 3, 4, 5, 6]]
* weights =        [[0.1, 0.6, 0.1, 0.1, 0.1],
                    [0.1, 0.2, 0.5, 0.1, 0.1]]
* target_support = [4, 5, 6, 7, 8]
In the code below, comments preceded with 'Ex:' will be referencing the above
values.
Args:
supports: Tensor of shape (batch_size, num_dims) defining supports for the
  distribution.
weights: Tensor of shape (batch_size, num_dims) defining weights on the
  original support points. Although for the CategoricalDQN agent these
  weights are probabilities, it is not required that they are.
target_support: Tensor of shape (num_dims) defining support of the projected
  distribution. The values must be monotonically increasing. Vmin and Vmax
  will be inferred from the first and last elements of this tensor,
  respectively. The values in this tensor must be equally spaced.
  validate_args: Whether we will verify the contents of the
  target_support parameter.
  Returns:
     A Tensor of shape (batch_size, num_dims) with the projection of a batch of
(support, weights) onto target_support.
  Raises:
    ValueError: If target_support has no dimensions, or if shapes of supports,
  weights, and target_support are incompatible.
 """
target_support_deltas = target_support[1:] - target_support[:-1]
# delta_z = `\Delta z` in Eq7.
delta_z = target_support_deltas[0]
validate_deps = []
supports.shape.assert_is_compatible_with(weights.shape)
supports[0].shape.assert_is_compatible_with(target_support.shape)
target_support.shape.assert_has_rank(1)
if validate_args:
# Assert that supports and weights have the same shapes.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(tf.shape(supports), tf.shape(weights))),
        [supports, weights]))
# Assert that elements of supports and target_support have the same shape.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(
            tf.equal(tf.shape(supports)[1], tf.shape(target_support))),
        [supports, target_support]))
# Assert that target_support has a single dimension.
validate_deps.append(
    tf.Assert(
        tf.equal(tf.size(tf.shape(target_support)), 1), [target_support]))
# Assert that the target_support is monotonically increasing.
validate_deps.append(
    tf.Assert(tf.reduce_all(target_support_deltas > 0), [target_support]))
# Assert that the values in target_support are equally spaced.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(target_support_deltas, delta_z)),
        [target_support]))

with tf.control_dependencies(validate_deps):
# Ex: `v_min, v_max = 4, 8`.
v_min, v_max = target_support[0], target_support[-1]
# Ex: `batch_size = 2`.
batch_size = tf.shape(supports)[0]
# `N` in Eq7.
# Ex: `num_dims = 5`.
num_dims = tf.shape(target_support)[0]
# clipped_support = `[\hat{T}_{z_j}]^{V_max}_{V_min}` in Eq7.
# Ex: `clipped_support = [[[ 4.  4.  4.  6.  8.]]
#                         [[ 4.  4.  4.  5.  6.]]]`.
clipped_support = tf.clip_by_value(supports, v_min, v_max)[:, None, :]
# Ex: `tiled_support = [[[[ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]]
#                        [[ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]]]]`.
tiled_support = tf.tile([clipped_support], [1, 1, num_dims, 1])
# Ex: `reshaped_target_support = [[[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]
#                                 [[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]]`.
reshaped_target_support = tf.tile(target_support[:, None], [batch_size, 1])
reshaped_target_support = tf.reshape(reshaped_target_support,
                                     [batch_size, num_dims, 1])
# numerator = `|clipped_support - z_i|` in Eq7.
# Ex: `numerator = [[[[ 0.  0.  0.  2.  4.]
#                     [ 1.  1.  1.  1.  3.]
#                     [ 2.  2.  2.  0.  2.]
#                     [ 3.  3.  3.  1.  1.]
#                     [ 4.  4.  4.  2.  0.]]
#                    [[ 0.  0.  0.  1.  2.]
#                     [ 1.  1.  1.  0.  1.]
#                     [ 2.  2.  2.  1.  0.]
#                     [ 3.  3.  3.  2.  1.]
#                     [ 4.  4.  4.  3.  2.]]]]`.
numerator = tf.abs(tiled_support - reshaped_target_support)
quotient = 1 - (numerator / delta_z)
# clipped_quotient = `[1 - numerator / (\Delta z)]_0^1` in Eq7.
# Ex: `clipped_quotient = [[[[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  1.]]
#                           [[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  1.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]]]]`.
clipped_quotient = tf.clip_by_value(quotient, 0, 1)
# Ex: `weights = [[ 0.1  0.6  0.1  0.1  0.1]
#                 [ 0.1  0.2  0.5  0.1  0.1]]`.
weights = weights[:, None, :]
# inner_prod = `\sum_{j=0}^{N-1} clipped_quotient * p_j(x', \pi(x'))`
# in Eq7.
# Ex: `inner_prod = [[[[ 0.1  0.6  0.1  0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0.1]]
#                     [[ 0.1  0.2  0.5  0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0.1]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]]]]`.
inner_prod = clipped_quotient * weights
# Ex: `projection = [[ 0.8 0.0 0.1 0.0 0.1]
#                    [ 0.8 0.1 0.1 0.0 0.0]]`.
projection = tf.reduce_sum(inner_prod, 3)
projection = tf.reshape(projection, [batch_size, num_dims])
return projection

Thank you in advance!

Can you confirm you're using tensorflow-gpu and that it can see the relevant nVidia driver? — jmkmay, Aug 14 '19 at 02:15
@jmkmay: Yes, I confirm that I am using tensorflow-gpu. Here is some output: ##: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 ##: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N ##: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N ##: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8698 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5) — Othmane, Aug 14 '19 at 16:22

score 0 · Answer 1 · answered Jul 07 '21 at 10:37

If there is anything wrong with your GPU, then Tensorflow will notify you with a Warning, when you first run the script.

Generally, C51-DQN algorithm is slower than DQN. This is because It takes longer time to compute the distribution of the reward of an action, rather than the expected value of the action.

Also, Google's Dopamine Rainbow/C51 implementation is faster than your custom implementation, because the memory buffer is connected directly to the TF Graph. This means that tensorflow doesn't waste time by

Retrieving experiences from the memory (RAM)
Converting numpy arrays to tensors
Making computations & concatenating columns
Feeding the results to network.

Instead, all the above are being done direclty inside the GPU.

If You want your program to become faster, there are a couple of things You can do:

Store experiences in TF-Variables, instead of RAM.
Use @tf.function to add all computations (e.g. forward computation of the states) inside the graph.

C51 reinforcement learning algorithm extremely slow

1 Answers1