I've been tracking down a SEGFAULT in Tensorflow. The issue can be reproduced with the following snippet:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.layers.batch_normalization(out, training=True)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
I've managed to track down the issue and I have a pull-request for it on github. If one were to run this code with my patch, one would get the following error message instead:
2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "breakage.py", line 21, in <module>
sess.run(out, feed_dict={xin: sample_in})
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
This seems to indicate a topological issue with my sample code. The problem seems to happen whenever I combine any kind of RNN, batch normalisation and the required additional control dependency
I've managed to mitigate the issue by relying on tf.contrib.layers.batch_norm
instead and setting the updates_collections
parameter to None
in order to inline the update operation.
For reference, here is the updated code sample:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
According to the documentation this may affect performance adversely and it's not clear to me what I'm doing wrong in the first place. Does my code look correct?
Also note that this issue only arises if Tensorflow is built with XLA JIT support which makes me think it might be a bug in Tensorflow.
EDIT: I have also filed an issue on Github