Why tensorflow uses channel-last ordering instead of row-major?

Question

In most tensorflow tutorials authors use channel-last dimension ordering, e.g.

input_layer = tf.reshape(features, [-1, 28, 28, 1])

where the last digit represents the number of channels (https://www.tensorflow.org/tutorials/layers). Being used to Theano and Numpy (both use C-ordering, i.e. row-major), I find this awkward. Moreover, having read the documentation on in-memory layout schemes in tensorflow, I reckon channel-last layout will cause more cache-misses, because convolutions are carried out on individual channels, while in channel-last ordering these channels are intermixed in linear memory, effectively shrinking the cache by N (where N is the number of channels), which is especially inefficient in 3D and 4D convolutions. Am I getting something wrong?

P.S.

I've found a closely-related thread (Tensorflow 3 channel order of color inputs). The author of the accepted answer states that TF uses row-major by default, but given that all of the tutorials I've found so far show channel-last ordering I find that claim misleading.

tensorflow.org must have reorganized their site because the "layers" link is now a 404. — escape-llc, Mar 01 '21 at 12:06

nessuno · Accepted Answer · 2017-06-27T07:48:14.247

28

Here's the explanation:

https://www.tensorflow.org/performance/performance_guide#use_nchw_image_data_format

Image data format refers to the representation of batches of images. TensorFlow supports NHWC (TensorFlow default) and NCHW (cuDNN default). N refers to the number of images in a batch, H refers to the number of pixels in the vertical dimension, W refers to the number of pixels in the horizontal dimension, and C refers to the channels (e.g. 1 for black and white, 3 for RGB, etc.) Although cuDNN can operate on both formats, it is faster to operate in its default format.

The best practice is to build models that work with both NCHW and NHWC as it is common to train using NCHW on GPU, and then do inference with NHWC on CPU.

The very brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. Then the TensorFlow team discovered that NCHW performs better when using the NVIDIA cuDNN library. The current recommendation is that users support both formats in their models. In the long term, we plan to rewrite graphs to make switching between the formats transparent.

Moreover, digging into the code we can see here that when the input is in the format NHWC, tensorflow converts it for you to NCHW.

  if (data_format == FORMAT_NHWC) {
    // Convert the input tensor from NHWC to NCHW.
    TensorShape nchw_shape =
        ShapeFromFormat(FORMAT_NCHW, in_batch, in_rows, in_cols, in_depths);
    if (in_depths > 1) {
      Tensor transformed_input;
      OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<T>::value,
                                             nchw_shape, &transformed_input));
      functor::NHWCToNCHW<GPUDevice, T, 4>()(
          ctx->eigen_device<GPUDevice>(),
          const_cast<const Tensor&>(input).tensor<T, 4>(),
          transformed_input.tensor<T, 4>());
      input = transformed_input;
    } else {
      // If depth <= 1, then just reshape.
      CHECK(input.CopyFrom(input, nchw_shape));
    }
  }

You can specify the data format you want to use for every operation but tensorflow at default doesn't use NCHW but NHWC, that's why even the TF defelopers still use NHWC to avoid to specify in every operation the format

edited Jun 27 '17 at 07:48

answered Jun 27 '17 at 07:28

nessuno

26,493
5
83
74

Does it mean that TF converts the representation to row-major when it sends the data to a GPU? P.S. I'm not the down-voter. – Eli Korvigo Jun 27 '17 at 07:31
1

This documentation confuses me, because in "Best practices" it recommend to "Use NCHW image data format", yet TF developers don't follow this recommendation in their own tutorials. – Eli Korvigo Jun 27 '17 at 07:36
1

Don't worry for the downvote, it happens. However look [here](https://github.com/tensorflow/tensorflow/blob/0be81439c91e297b078152dd0c266471b24bde7f/tensorflow/core/kernels/conv_ops.cc#L558-L575): if the format is NHWC then tensorflow converts it for you to NCHW. You can specify the data format you want to use for every operation and tensorflow at default doesn't use NCHW but NHWC, that's why even the TF defelopers still use NHWC to avoid to specify in every operation the format – nessuno Jun 27 '17 at 07:39
I guess, you can add this comment to the answer. – Eli Korvigo Jun 27 '17 at 07:44
@nessuno I had trained a model on `GPU` in `NCHW` format. When I tried running this on `CPU` it threw `Default MaxPoolingOp only supports NHWC. [[Node: max_pooling2d/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="SAME", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](initial_conv)]]`. I can run this model on `GPU` but not on `CPU`, do you know how to solve this? – Effective_cellist Mar 02 '18 at 09:10
1

You're facing this issue: https://github.com/tensorflow/tensorflow/issues/2660 . The maxpool operation you have in your graph has the order saved in its definition and therefore when you try to change the device you encounter this error. IMHO you have to load the trained model on GPU, perform network surgery changing the maxpool node in order to use the NHWC format and save the model again – nessuno Mar 02 '18 at 09:24
@nessuno Thank you for your comment. Could you please help me understand how to change the order of max pool node to NHWC format – Effective_cellist Mar 02 '18 at 09:50
1

Define your model in python. Use a saver to restore the model from the checkpoint. Define a new model in python with the maxpool desidred data format ( https://www.tensorflow.org/api_docs/python/tf/nn/max_pool ). Copy the weights from the old model to the new one. Save the new model. More or less this is the way to go – nessuno Mar 02 '18 at 09:56
3

nice clarification! But OP had a good point that NHWC seems not to be friendly to cache misses, how did TF implement it to obtain good performance on CPU devices though? – galactica Jun 11 '18 at 20:41
I'm a little confused when you say "format NHWC, tensorflow converts it for you to NCHW." If tensorflow converts it for you, does it really matter? Is the only issue the overhead of tensorflow converting the data from NHWC to HCHW? Also, when you say tensorflow converts it for you, is this conversion done before data is sent to GPU or is data sent GPU and then converted from NHWC to NCHW? – user3731622 Feb 01 '19 at 18:37
It matters because the conversion is made by Tensorlow in the CPU, before sending it to the GPU (at least, from my understanding of the above snippet, that seems to use eigen and not cuda) – nessuno Feb 01 '19 at 20:07
3

Cited link for "performance_guide" is now a 404. – escape-llc Mar 01 '21 at 11:59

score 9 · Answer 2 · answered Jun 27 '17 at 07:39

Your question is based on a misunderstanding.

There is no contradiction between row-major and NHWC. Row-major means that the rightmost index is the one that causes the smallest jumps in memory when it changes, and changes in the leftmost index cause the biggest jumps. In row-major, the last dimension is contiguous, in column-major, the first one is. See https://en.wikipedia.org/wiki/Row-_and_column-major_order#Address_calculation_in_general for how to calculate memory offsets for arbitrary number of dimensions.

So, TF's memory IS laid out in row-major. The differences in order of the indexes are subtle (some people even prefer CHWN - see https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-155944875). NCHW is popular because it's what cudnn does best. But basically every common memory layout in DL is row-major.

My confusion comes from the fact that in channel-last representations you basically get an H x W grid of C-dimensional vectors instead of C H x W grids, which look more natural for row-major layouts to me. Anyway, thank you for correcting me. — Eli Korvigo, Jun 27 '17 at 08:00

Why tensorflow uses channel-last ordering instead of row-major?

2 Answers2