1

I tried to run the following graph: Graph that causes the error.

Unfortunately, I receive the following error message:

tensorflow.python.framework.errors.InternalError: Message length was negativ
 [[Node: random_uniform_1_S1 = _Recv[client_terminated=false,
  recv_device= "/job:worker/replica:0/task:1/cpu:0",
  send_device="/job:worker/replica:0/task:0/cpu:0",
  send_device_incarnation=3959744268201087672,
  tensor_name="edge_18_random_uniform_1",
  tensor_type=DT_DOUBLE,
  _device="/job:worker/replica:0/task:1/cpu:0"]()]]

I noticed that this error message does not occur if the size of random_uniform_1 is 800MB, but it does occur if the size is 8GB.

(Notice that random_uniform_1 has to be transferred from one device to another device.)

Question: Is there a limit on how big a tensor can be, if that tensor has to be transferred between devices?

Jonathan Holvey
  • 697
  • 9
  • 27
PaulWen
  • 1,025
  • 1
  • 13
  • 24

1 Answers1

2

Yes, currently there is a 2GB limit on an individual tensor when sending it between processes. This limit is imposed by the protocol buffer representation (more precisely, by the auto-generated C++ wrappers produced by the protoc compiler) that is used in TensorFlow's communication layer.

We are investigating ways to lift this restriction. In the mean time, you can work around it by manually adding tf.split() or tf.slice(), and tf.concat() operations to partition the tensor for transfer. If you have very large tf.Variable objects, you can use variable partitioners to perform this transformation automatically. Note that in your program you have multiple 8GB tensors in memory at once, so the peak memory utilization will be at least 16GB.

mrry
  • 125,488
  • 26
  • 399
  • 400