Tensorflow nn.conv3d() and max_pool3d

Question

Recently Tensorflow added support for 3d convolution. I'm attempting to train some video stuff.

I have a few of questions:

My inputs are 16-frame, 3-channel per frame .npy files, so their shape is: (128, 171, 48).

1) The docs for tf.nn.max_pool3d() state the shape of the input should be: Shape [batch, depth, rows, cols, channels]. Is my channels dimension still 3 even though my npy imgs are 48 channels deep, so to speak?

2) The next question dovetails from the last one: is my depth 48 or 16?

3) (since I'm here) The batch dimension is the same with 3d arrays, correct? The images are just like any other image, processed one at a time.

Just to be clear: in my case, for a single image batch size, with the image dims above, my dimensions are:

[1(batch),16(depth), 171(rows), 128(cols), 3(channels)]

EDIT: I've confused raw input size with pooling and kernel sizes here. Perhaps some general guidance on this 3D stuff would be helpful. I basically am stuck on the dimensions for both convolution and pooling, as is clear in the original question.

score 13 · Accepted Answer · edited Jun 20 '20 at 09:12

To answer your question, the dimension should be (as you stated): [batch_size, depth, H, W, 3] where depth is the number of time frames you have.

For instance, a 5s video with 20 frames/s will have depth=100.

My best advice would be to first read the slides from CS231n about deep learning for videos here (if you can see the video, it's even better).

Basically, a 3D convolution is the same as a 2D convolution but with one more dimension. Let's do a recap:

1D convolution (ex: text):

the input is of shape [batch_size, 10, in_channels]
the kernel is of shape [3, in_channels, out_channels]
ex: for text, this is a sentence of length 10, with word embeddings of dim in_channels
the kernel goes over the sentence (dim 10) with a kernel of size 3

2D convolution (ex: image):

the input is of shape [batch_size, 10, 10, in_channels]
the kernel is of shape [3, 3, in_channels, out_channels]
ex: RGB image of size 10x10, with in_channels=3
the kernel goes over the image (dim 10x10) with a kernel of size 3
the kernel is a square sliding over the image

3D convolution (ex: video)

the input is of shape [batch_size, T, 10, 10, in_channels]
the kernel is of shape [T_kernel, 3, 3, in_channels, out_channels]
ex: video with T=100 frames, and images of size 10x10, with in_channels=3
the kernel goes over the video (dim 100x10x10) with a kernel of size T_kernel (ex: T_kernel=10)
the kernel is like a cube, sliding over the "cube" of the video (time * W * H)

The goal of a convolution is to reduce the number of parameters because of redundancies in the data. For images, you can extract the same basic features in the top left 3x3 box and the bottom right 3x3 box.

For videos, this is the same. You can extract information from a 3x3 box of the image, but within a time frame (ex: 10 frames). The result will have a receptive field of 3x3 in image dimension, and 10 frames in time dimension.

`For instance, a 5s video with 20 frames/s will have depth=200` why 200 and not 100??? — Petr Shypila, Jan 20 '17 at 23:54

Tensorflow nn.conv3d() and max_pool3d

1 Answers1

1D convolution (ex: text):

2D convolution (ex: image):

3D convolution (ex: video)