Using Tensorflow with RNNs & batch normalisation

Question

I've been tracking down a SEGFAULT in Tensorflow. The issue can be reproduced with the following snippet:

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.layers.batch_normalization(out, training=True)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

I've managed to track down the issue and I have a pull-request for it on github. If one were to run this code with my patch, one would get the following error message instead:

2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:                                                                                                                          
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "breakage.py", line 21, in <module>
    sess.run(out, feed_dict={xin: sample_in})
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count

This seems to indicate a topological issue with my sample code. The problem seems to happen whenever I combine any kind of RNN, batch normalisation and the required additional control dependency

I've managed to mitigate the issue by relying on tf.contrib.layers.batch_norm instead and setting the updates_collections parameter to None in order to inline the update operation.

For reference, here is the updated code sample:

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

According to the documentation this may affect performance adversely and it's not clear to me what I'm doing wrong in the first place. Does my code look correct?

Also note that this issue only arises if Tensorflow is built with XLA JIT support which makes me think it might be a bug in Tensorflow.

EDIT: I have also filed an issue on Github

Using Tensorflow with RNNs & batch normalisation

0 Answers0