0

For the reinforcement learning one usually applies forward pass of the neural network for each step of the episode in order to calculate policy. Afterwards one could calculate parameter gradients using backpropagation. Simplified implementation of my network looks like this:

class AC_Network(object):

    def __init__(self, s_size, a_size, scope, trainer, parameters_net):
        with tf.variable_scope(scope):
            self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
            self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
            # (...)
            layer = slim.fully_connected(self.inputs, 
                                         layer_size,
                                         activation_fn=tf.nn.relu,
                                         biases_initializer=None)
            layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"], 
                                              is_training=self.is_training)

            self.policy = slim.fully_connected(layer, a_size,
                                               activation_fn=tf.nn.softmax,
                                               biases_initializer=None)

            self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
            self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
            actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
            responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
            self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)

             local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
             self.gradients = tf.gradients(self.policy_loss, local_vars)

Now during training I will fist rollout the episode by consecutive forward passes (again, simplified version):

s = self.local_env.reset() # list of input variables for the first step
while done == False:
    a_dist = sess.run([self.policy],
                      feed_dict = {self.local_AC.inputs: [s],
                                   self.is_training: True})
    a = np.argmax(a_dist)
    s, r, done, extra_stat = self.local_env.step(a)
    # (...)

and in the end I will calculate gradients by backward pass:

p_l, grad = sess.run([self.policy_loss,
                      self.gradients],
                      feed_dict={self.inputs: np.vstack(comb_observations),
                                 self.is_training: True,
                                 self.actions: np.hstack(comb_actions),})

(please note that I could have made a mistake somewhere above trying to remove as much as possible of the original code irrelevant to the issue in question)

So finally the question: Is there a way of ensuring that all the consecutive calls to the sess.run() will generate the same dropout structure? Ideally I would like to have exactly the same dropout structure within each episode and only change it between episodes. Things seem to work well as they are but I continue to wonder.

pegazik
  • 115
  • 2
  • 9

0 Answers0