Channel first and channel last in convolution

Question

I saw there are two types of data: channel first and last in the world of convolutional networks.

According to many websites, "channel-first" refers to NCHW format, while "channel-last" is equivalent to NHWC format. This is clear because in channel first format, C is positioned before H and W. However, ARM seems to have defined "channel-first" as NHWC, as you can see in this paper.

P6: The two most common image data formats are Channel-Width-Height (CHW), i.e. channel last, and Height-Width-Channel (HWC), i.e. channel first. The dimension ordering is the same as that of the data stride. In an HWC format, the data along the channel is stored with a stride of 1, data along the width is stored with a stride of the channel count, and data along the height is stored with a stride of (channel count × image width).

This is also reasonable since "Channel first" sounds like MAC operation goes channel-wise like below:

for (N){
  for (H){
    for (W){
      for (C){
      }
    }
  }
}

So there is no fixed definition of channel-first or channel-last, isn't there?

Also, I'm not sure when you say NHWC or NCHW, what do you specifically mean? I guess the important thing is the combination of algorithms and the data arrangement in memory. If the data comes in in NHWC format, you need to design the algorithm like so.

And, since there are no fixed definitions of NHWC and NCHW, I don't think it makes any sense if you just say PyTorch is NCHW, channel-first or something without mentioning how the data arranges in memory.

Or when you hear NCHW, you can realize that the data arrangement in memory is like ch0[0,0], ch1[0, 0], ch2[0, 0], ch0[1, 0], ch1[1, 0], ch2[1, 0], ch0[2, 0], ...?

Can anyone help clarify my understanding of the data format?

Ivan · Answer 1 · 2021-08-03T11:41:47.000

I had originally overlooked the paper you linked where they clearly define the two opponents to how the terms are usually employed in the documentation and elsewhere. There are indeed two different ways to look at CHW and HWC...

TLDR; For end-users CHW is channel-first while HWC is channel last. In this case, we refer to the position of the channel dimension with regards to the other dimensions (H and W). Wether it is before CHW or after HWC is a matter of convention defined by the library used (eg. PyTorch vs. Tensorflow). In terms of memory allocation, it makes sense to call CHW channel last, it means that the channel axis' stride will be last: it will be unfolded last with regards to the other axes of the tensor.

I don't think it makes any sense if you just say PyTorch is NCHW, channel first or something without mentioning how the data arranges in memory.

For the end-user (as in end-developer), it does not matter how memory is allocated or arranged. The important part is to know how to use the API provided by PyTorch to manipulate torch.Tensors. When we say NCHW, we mean 'channel-first', i.e. tensors of shape (batch_size, channel, height, width). In all PyTorch documentation pages, you will find the exact shapes, inputs, and outputs tensors are required to have. It just happens they have chosen to stick with the NCHW convention for 2-dimensional channel tensors.

It makes sense to stick with one format let it be for the underlying implementation - where the memory arrangement does matter - or the end-user itself - who is used to working with a single format.

In TensorFlow for instance, channel is last, so the format used is NHWC.

To come back to how HWC (resp. CHW) was named channel-first (resp. channel-last) in the paper you linked. This has to do with the tensor stride: i.e. the layout of data in memory. Intuitively you can think that format HWC is channel-first because the channel dimension is the first axis to get unfolded.

If you look at this example:

>>> x = torch.rand(2,3,4) # last dimension is the channel axis
tensor([[[0.5567, 0.0276, 0.6491, 0.7933],
         [0.2876, 0.0361, 0.3883, 0.3201],
         [0.6742, 0.0305, 0.5719, 0.4683]],

        [[0.3385, 0.2082, 0.1675, 0.3429],
         [0.6146, 0.0533, 0.6147, 0.2216],
         [0.1855, 0.6107, 0.1716, 0.0071]]])

The underlying memory arangement is actually revealed when flattening the data (assuming the initial tensor's data is contiguous in memory):

>>> x.flatten()
tensor([0.5567, 0.0276, 0.6491, 0.7933, 0.2876, 0.0361, 0.3883, 0.3201, 0.6742,
        0.0305, 0.5719, 0.4683, 0.3385, 0.2082, 0.1675, 0.3429, 0.6146, 0.0533,
        0.6147, 0.2216, 0.1855, 0.6107, 0.1716, 0.0071])

Notice above how the data is laid out: going channel by channel 0.5567, 0.0276, 0.6491, 0.7933, then 0.2876, 0.0361, 0.3883, 0.3201, etc...

In the other format (i.e. CHW) it would have been laid out as 0.5567, 0.2876, 0.6742, then 0.3385, 0.6146, 0.1855, etc...

So it does make sense to call CHW channel-last (HWC as channel-first) when referring to how data is allocated in memory.

I can elaborate on the stride of the tensor if you require some more clarification. — Ivan, Aug 03 '21 at 11:43
Thank you very much for your elaborate explanation. And I'm very sorry for having not noticed your answer for a long time. I understood, on user's side, we could suppose that HWC was called channel last and CHW was called channel first. In a rare case (not sure if I can say 'rare' though) HWC is referred to as channel first to better describe the memory allocation. It's a bit confusing for a beginner like me :( — MAMO, Oct 31 '21 at 03:12

Channel first and channel last in convolution

1 Answers1