I believe I understand the differences between NHWC and HCHW.
NHWC (batch, height, width, channels) has output index order of, assuming 3 input channels representing RGB, R0, G0, B0, R1, G1, B1, ..., Rc, Gc, Bc.
NCHW (batch, channels, height, width) has output index order of, assuming 3 input channels representing RGB, R0, R1, ..., Rc, G0, G1 ,..., Gc, B0, B1, ..., Bc.
(is this the only difference?)
What I am wondering, is why is NCHW better/faster when it comes to GPU performance.