How to make a selective back-propagation in a mini-batch in Tensorflow?

Question

Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow." (Here, a trajectory means a sequence of 2D positions.)

Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.

The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.

When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.

The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.

I also used tf.where function when calculating the loss. However, the training time does not decrease.

How can I reduce the training time?

Here I attached my pseudo code for training.

# For each frame in a sequence
for f in range(pred_length):

    # For each element in a batch
    for b in range(batch_size):


        with tf.variable_scope("rnnlm") as scope:
            if (f > 0 or b > 0):
                scope.reuse_variables()

            # for each pedestrian in an element
            for p in range(MNP):

                # ground-truth position
                cur_gt_pose = ...

                # loss mask
                loss_mask_ped = ... # '1' or '0'

                # go through RNN decoder
                output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
                                                                                    zero_states_dec_list[b][p])

                # fully connected layer for output
                cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)

                # go through embedding function for the next input
                prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))

                # calculate MSE loss
                mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))

                # only valid ped's traj contributes to the loss
                self.loss += tf.multiply(mse_loss, loss_mask_ped)

pred_length = 12, batch_size = 1, MNP = 40, rnn_size = 32.. Tell me if you want to know in details. Thank you :) — cdsjjav, Dec 12 '18 at 01:43
Batch Size should also be considered as a hyperparameter, and with that the RNN size and maybe I am wrong the your LSTM should have number of epoch as a parameter? — prosti, Dec 12 '18 at 01:54
Can you comment on loss function. Is this your custom loss function? "MSE loss" from your pseudo code. — prosti, Dec 12 '18 at 02:09
Input is a numpy array of size [batch_size x MNP x seq_length x 2]. In my question, I said 'mini_batch size is set to 10 and the number of the past trajectories to be included in the mini batch can be variable'. In my pseudo code, MNP (maximum number pedestrian) is fixed to 40, but the number of pedestrians (to be included in the mini-batch) can be smaller. **The mini-batch size mentioned in my question** corresponds to **MNP in pseudo code**. (sorry for confusing you.) Even if I increase MNP, the training result will be the same because the number of **real** pedestrian's trajectory is fixed. — cdsjjav, Dec 12 '18 at 02:14
Yes. The loss is custom loss function. Just Euclidean distance between two 2D positions. — cdsjjav, Dec 12 '18 at 02:16
Have you checked these before, https://stats.stackexchange.com/questions/286713/optimum-number-of-epochs-and-neurons-for-an-lstm-network https://ai.stackexchange.com/questions/3156/how-to-select-number-of-hidden-layers-and-number-of-memory-cells-in-lstm these are 2 alternative ai sites. — prosti, Dec 12 '18 at 02:22
Thank you for your comments!! I have read that post already. My concern is about **'how to speed up training time under fixed epoch and iterations per epoch'**. But your comment is also helpful because the training may require smaller epoch number if I set all the hyper-parameters correctly!! — cdsjjav, Dec 12 '18 at 02:35

score 0 · Answer 1 · answered Dec 12 '18 at 03:23

I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.

However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.

I think you understand what I want to know exactly. Thank you so much — cdsjjav, Dec 12 '18 at 04:12

How to make a selective back-propagation in a mini-batch in Tensorflow?

1 Answers1