I am applying reinforcement learning on a time series prediction problem. Until now, I have implemented a dueling DDQN algorithm with LSTM which seems to give some pretty good results, though sometimes slow to converge depending on the exact problem. I have then used C51 distributional reinforcement learning to compare the performance (which I expect to lead to better good results).
I have slightly adapted the C51 google code dopamine to be integrated into my code (network and training part). I am also using double Q learning for selection of next state action (which original code does not use). But, the problem is that it is really very very slow to execute. For comparaison, my previous dueling DDQN used to take 3.5 hours to train for 50000 episodes, however C51 algorithm has now taken almost 10 hours and yet has only reached 3000 episodes.
I am wondering if there is something wrong with my adaptation of the code, or if C51 algorithm is really that slow. I am using an NVidia Geforce RTX 2080 Ti.
Here is the network part:
#network part
self.weights_initializer = tf.contrib.slim.variance_scaling_initializer(factor=1.0 / np.sqrt(3.0), mode='FAN_IN', uniform=True)
self.net = tf.contrib.slim.fully_connected(
self.rnn, # output of an LSTM
num_actions * num_atoms,
activation_fn=None,
weights_initializer=self.weights_initializer)
self.logits = tf.reshape(self.net, [-1, num_actions, num_atoms])
self.probabilities = tf.contrib.layers.softmax(self.logits)
self.q_values = tf.reduce_sum(self._support * self.probabilities, axis=2)
self.predict = tf.argmax(self.q_values,1)
self.actions = tf.placeholder(shape=[None],dtype=tf.int32)
self.target_distribution = tf.placeholder(shape=[None,num_atoms],dtype=tf.float32)
# size of indices: batch_size x 1.
self.indices = tf.range(tf.shape(self.logits)[0])[:, None]
# size of reshaped_actions: batch_size x 2.
self.reshaped_actions = tf.concat([self.indices, self.actions[:, None]], 1)
# For each element of the batch, fetch the logits for its selected action.
self.chosen_action_logits = tf.gather_nd(self.logits,
self.reshaped_actions)
self.td_error = tf.nn.softmax_cross_entropy_with_logits(labels=self.target_distribution,logits=self.chosen_action_logits)
# divide by the real length of episodes instead of averaging which is incorrect
self.loss = tf.cast(tf.reduce_sum(self.td_error), tf.float64) / tf.cast(tf.reduce_sum(self.seq_len), tf.float64)
if apply_grad_clipping:
# calculate gradients and clip them to handle outliers
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), grad_clipping)
self.updateModel = optimizer.apply_gradients(
zip(grads, tvars),
name="updateModel")
else:
self.updateModel = optimizer.minimize(self.loss, name="updateModel")
And here's the training part:
# training part
if i >= pre_train_episodes:
#Reset the lstm's hidden state
state_train = np.zeros((num_layers, 2, batch_size, h_size))
#Get a random batch of experiences.
trainBatch = myBuffer.sample(batch_size)
#Below we perform the Double-DQN update to the target Q-values
num_samples = batch_size*trace_length
# size of rewards: batch_size x 1
rewards = trainBatch[:,2][:, None]
# size of tiled_support: batch_size x num_atoms
tiled_support = tf.tile(mainQN._support, [num_samples])
tiled_support = tf.reshape(tiled_support, [num_samples, num_atoms])
# size of target_support: batch_size x num_atoms
is_terminal_multiplier = -(np.array(trainBatch[:,4]) - 1)
# Incorporate terminal state to discount factor.
# size of gamma_with_terminal: batch_size x 1
gamma_with_terminal = gamma * is_terminal_multiplier
gamma_with_terminal = gamma_with_terminal[:, None]
target_support = rewards + gamma_with_terminal * tiled_support
next_qt_argmax = sess.run([mainQN.predict], feed_dict={\
mainQN.scalarInput:np.vstack(trainBatch[:,3]),\
mainQN.trainLength:trace_length,mainQN.state_in:state_train,mainQN.batch_size:batch_size})
next_qt_argmax = np.reshape(next_qt_argmax, [-1, 1])
probabilities = sess.run(targetQN.probabilities, feed_dict={\
targetQN.scalarInput:np.vstack(trainBatch[:,3]),\
targetQN.trainLength:trace_length,targetQN.state_in:state_train,targetQN.batch_size:batch_size})
batch_indices = np.arange(num_samples)[:, None]
batch_indexed_next_qt_argmax = np.concatenate([batch_indices, next_qt_argmax], axis=1)
# size of next_probabilities: batch_size x num_atoms
next_probabilities = tf.gather_nd(probabilities, batch_indexed_next_qt_argmax)
target_distribution = project_distribution(target_support, next_probabilities, mainQN._support)
target_distribution = target_distribution.eval()
loss, _, _ = sess.run([mainQN.loss, mainQN.check_ops, mainQN.updateModel], \
feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.target_distribution:target_distribution,\
mainQN.actions:trainBatch[:,1],mainQN.trainLength:trace_length,\
mainQN.state_in:state_train,mainQN.batch_size:batch_size})
# perform soft/hard update frequently
if i % update_target_freq == 0 or update_target_freq == 1 or softUpdate == True:
updateTarget(targetOps,sess)
auxiliary function:
# function used above to project the distribution on the provided support
def project_distribution(supports, weights, target_support,
validate_args=False):
"""Projects a batch of (support, weights) onto target_support.
Based on equation (7) in (Bellemare et al., 2017):
https://arxiv.org/abs/1707.06887
In the rest of the comments we will refer to this equation simply as Eq7.
This code is not easy to digest, so we will use a running example to clarify
what is going on, with the following sample inputs:
* supports = [[0, 2, 4, 6, 8],
[1, 3, 4, 5, 6]]
* weights = [[0.1, 0.6, 0.1, 0.1, 0.1],
[0.1, 0.2, 0.5, 0.1, 0.1]]
* target_support = [4, 5, 6, 7, 8]
In the code below, comments preceded with 'Ex:' will be referencing the above
values.
Args:
supports: Tensor of shape (batch_size, num_dims) defining supports for the
distribution.
weights: Tensor of shape (batch_size, num_dims) defining weights on the
original support points. Although for the CategoricalDQN agent these
weights are probabilities, it is not required that they are.
target_support: Tensor of shape (num_dims) defining support of the projected
distribution. The values must be monotonically increasing. Vmin and Vmax
will be inferred from the first and last elements of this tensor,
respectively. The values in this tensor must be equally spaced.
validate_args: Whether we will verify the contents of the
target_support parameter.
Returns:
A Tensor of shape (batch_size, num_dims) with the projection of a batch of
(support, weights) onto target_support.
Raises:
ValueError: If target_support has no dimensions, or if shapes of supports,
weights, and target_support are incompatible.
"""
target_support_deltas = target_support[1:] - target_support[:-1]
# delta_z = `\Delta z` in Eq7.
delta_z = target_support_deltas[0]
validate_deps = []
supports.shape.assert_is_compatible_with(weights.shape)
supports[0].shape.assert_is_compatible_with(target_support.shape)
target_support.shape.assert_has_rank(1)
if validate_args:
# Assert that supports and weights have the same shapes.
validate_deps.append(
tf.Assert(
tf.reduce_all(tf.equal(tf.shape(supports), tf.shape(weights))),
[supports, weights]))
# Assert that elements of supports and target_support have the same shape.
validate_deps.append(
tf.Assert(
tf.reduce_all(
tf.equal(tf.shape(supports)[1], tf.shape(target_support))),
[supports, target_support]))
# Assert that target_support has a single dimension.
validate_deps.append(
tf.Assert(
tf.equal(tf.size(tf.shape(target_support)), 1), [target_support]))
# Assert that the target_support is monotonically increasing.
validate_deps.append(
tf.Assert(tf.reduce_all(target_support_deltas > 0), [target_support]))
# Assert that the values in target_support are equally spaced.
validate_deps.append(
tf.Assert(
tf.reduce_all(tf.equal(target_support_deltas, delta_z)),
[target_support]))
with tf.control_dependencies(validate_deps):
# Ex: `v_min, v_max = 4, 8`.
v_min, v_max = target_support[0], target_support[-1]
# Ex: `batch_size = 2`.
batch_size = tf.shape(supports)[0]
# `N` in Eq7.
# Ex: `num_dims = 5`.
num_dims = tf.shape(target_support)[0]
# clipped_support = `[\hat{T}_{z_j}]^{V_max}_{V_min}` in Eq7.
# Ex: `clipped_support = [[[ 4. 4. 4. 6. 8.]]
# [[ 4. 4. 4. 5. 6.]]]`.
clipped_support = tf.clip_by_value(supports, v_min, v_max)[:, None, :]
# Ex: `tiled_support = [[[[ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]
# [ 4. 4. 4. 6. 8.]]
# [[ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]
# [ 4. 4. 4. 5. 6.]]]]`.
tiled_support = tf.tile([clipped_support], [1, 1, num_dims, 1])
# Ex: `reshaped_target_support = [[[ 4.]
# [ 5.]
# [ 6.]
# [ 7.]
# [ 8.]]
# [[ 4.]
# [ 5.]
# [ 6.]
# [ 7.]
# [ 8.]]]`.
reshaped_target_support = tf.tile(target_support[:, None], [batch_size, 1])
reshaped_target_support = tf.reshape(reshaped_target_support,
[batch_size, num_dims, 1])
# numerator = `|clipped_support - z_i|` in Eq7.
# Ex: `numerator = [[[[ 0. 0. 0. 2. 4.]
# [ 1. 1. 1. 1. 3.]
# [ 2. 2. 2. 0. 2.]
# [ 3. 3. 3. 1. 1.]
# [ 4. 4. 4. 2. 0.]]
# [[ 0. 0. 0. 1. 2.]
# [ 1. 1. 1. 0. 1.]
# [ 2. 2. 2. 1. 0.]
# [ 3. 3. 3. 2. 1.]
# [ 4. 4. 4. 3. 2.]]]]`.
numerator = tf.abs(tiled_support - reshaped_target_support)
quotient = 1 - (numerator / delta_z)
# clipped_quotient = `[1 - numerator / (\Delta z)]_0^1` in Eq7.
# Ex: `clipped_quotient = [[[[ 1. 1. 1. 0. 0.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 1. 0.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 1.]]
# [[ 1. 1. 1. 0. 0.]
# [ 0. 0. 0. 1. 0.]
# [ 0. 0. 0. 0. 1.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 0.]]]]`.
clipped_quotient = tf.clip_by_value(quotient, 0, 1)
# Ex: `weights = [[ 0.1 0.6 0.1 0.1 0.1]
# [ 0.1 0.2 0.5 0.1 0.1]]`.
weights = weights[:, None, :]
# inner_prod = `\sum_{j=0}^{N-1} clipped_quotient * p_j(x', \pi(x'))`
# in Eq7.
# Ex: `inner_prod = [[[[ 0.1 0.6 0.1 0. 0. ]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0.1 0. ]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. 0.1]]
# [[ 0.1 0.2 0.5 0. 0. ]
# [ 0. 0. 0. 0.1 0. ]
# [ 0. 0. 0. 0. 0.1]
# [ 0. 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. 0. ]]]]`.
inner_prod = clipped_quotient * weights
# Ex: `projection = [[ 0.8 0.0 0.1 0.0 0.1]
# [ 0.8 0.1 0.1 0.0 0.0]]`.
projection = tf.reduce_sum(inner_prod, 3)
projection = tf.reshape(projection, [batch_size, num_dims])
return projection
Thank you in advance!