Some time ago (before CuDNN introduced its own RNN/LSTM implementation), you would use a tensor of shape [B,T,D] (batch-major) or [T,B,D] (time-major) and then have a straight-forward LSTM implementation. Straight-forward means e.g. pure Theano or pure TensorFlow.
It was (is?) common wisdom that time-major is more efficient for RNNs/LSTMs.
This might be due to the unrolling internal details of Theano/TensorFlow (e.g. in TensorFlow, you would use tf.TensorArray
, and that naturally unrolls over the first axis, so it must be time-major, and otherwise it would imply a transpose to time-major; and not using tf.TensorArray
but directly accessing the tensor would be extremely inefficient in the backprop phase).
But I think this is also related to memory locality, so even with your own custom native implementation where you have full control over these details (and thus could choose any format you like), time-major should be more efficient.
(Maybe someone can confirm this?)
(In a similar way, for convolutions, batch-channel major (NCHW) is also more efficient. See here.)
Then CuDNN introduced their own RNN/LSTM implementation and they used packed tensors, i.e. with all padding removed. Also, sequences must be sorted by sequence length (longest first). This is also time-major but without the padded frames.
This caused some difficulty in adopting these kernels because padded tensors (non packed) were pretty standard in all frameworks up to that point, and thus you need to sort by seq length, pack it, then call the kernel, then unpack it and undo the sequence sorting. But slowly the frameworks adopted this.
However, then Nvidia extended the CuDNN functions (e.g. cudnnRNNForwardTrainingEx
, and then later cudnnRNNForward
). which now supports all three formats:
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED
: Data layout is padded, with outer stride from one time-step to the next (time-major, or sequence-major)CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED
: Data layout is padded, with outer stride from one batch to the next (batch-major)CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED
: The sequence length is sorted and packed as in the basic RNN API (time-major without padding frames, i.e. packed)
CuDNN references: CuDNN developer guide, CuDNN API reference (search for "packed", or "padded").
See for example cudnnSetRNNDataDescriptor
. Some quotes:
With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported.
This data structure is intended to support the unpacked (padded) layout for input and output of extended RNN inference and training functions. A packed (unpadded) layout is also supported for backward compatibility.
In TensorFlow, since CuDNN supports the padded layout, they have cleaned up the code and only support the padded layout now. I don't see that you can use the packed layout anymore. (Right?) (I'm not sure why this decision was made. Just to have simpler code? Or is this more efficient?)
PyTorch only supports the packed layout properly (when you have sequences of different lengths) (documentation).
Despite computational efficiency, there is also memory efficiency. Obviously the packed tensor is better w.r.t. memory consumption. So this is not really the question.
I mostly wonder about computational efficiency. Is the packed format most efficient? Or just the same as padded time-major? Time-major is more efficient than batch-major?
(This question is not necessarily about CuDNN, but in general about any naive or optimized implementation in CUDA.)
But obviously, this question also depends on the remaining neural network. When you mix the LSTM together with other modules which might require non-packed tensors, you would have a lot of packing and unpacking, if the LSTM uses the packed format. But consider that you could re-implement all other modules as well to work on packed format: Then maybe packed format would be better in every aspect?
(Maybe the answer is, there is no clear answer. But I don't know. Maybe there is also a clear answer. Last time I actually measured, the answer was pretty clear, at least for some parts of my question, namely that time-major is in general more efficient than batch-major for RNNs. Maybe the answer is, it depends on the hardware. But this should not be a guess, but either with real measurements, or even better with some good explanation. From the best of my knowledge, this should be mostly invariant to the hardware. It would be kind of unexpected to me if the answer varies depending on the hardware. I also assume that packed vs padded probably should not really make any/much a difference, again no matter the hardware. But maybe someone really knows.)