Recently Tensorflow added support for 3d convolution. I'm attempting to train some video stuff.
I have a few of questions:
My inputs are 16-frame, 3-channel per frame .npy
files, so their shape is: (128, 171, 48)
.
1) The docs for tf.nn.max_pool3d()
state the shape of the input should be:
Shape [batch, depth, rows, cols, channels]
. Is my channels dimension still 3 even though my npy imgs
are 48 channels deep, so to speak?
2) The next question dovetails from the last one: is my depth 48 or 16?
3) (since I'm here) The batch dimension is the same with 3d arrays, correct? The images are just like any other image, processed one at a time.
Just to be clear: in my case, for a single image batch size, with the image dims above, my dimensions are:
[1(batch),16(depth), 171(rows), 128(cols), 3(channels)]
EDIT: I've confused raw input size with pooling and kernel sizes here. Perhaps some general guidance on this 3D stuff would be helpful. I basically am stuck on the dimensions for both convolution and pooling, as is clear in the original question.