1

I have a dataset in which each datapoint consists of 5 videos in two dimensions, as a numpy array with shape (48,128,42,5). (height, width, frames, video index). The multiple videos basically serve as "slices" to get some information about depth, although imperfect.

I want to create a CNN using Keras/Tensorflow for regression, but Keras only has built-in Convolutional layers for up to 3 dimensions. Is there a good way to perform convolution and max-pooling on 4 dimensional data? Or will I need to create my own layer using Tensorflow?

  • 1
    In video, we have 2 spatial dimension and 1 temporal dimension, so Conv3D is not the perfect thing to use. You can use Conv2D with TimeDistributed or ConvLSTM which will perform better. Here's an answer that may help: https://stackoverflow.com/questions/61431708/transfer-learning-for-video-classification/61433610#61433610 – Zabir Al Nazi May 17 '20 at 10:53

1 Answers1

0

TL;DR - You still only need Conv3D

Don't let the vector shape confuse you, the number of dimensions of the convolution layer refers to the dimensions on which the filter slides, not the input shape.

For example, if you'd want to process audio, you would still use Conv1D as you only slide the filter over time (1-D), even though the audio signal might have 2 channels and therefore shape e.g. (1,44100,2) (1 file, 44100 samples per second - assuming 1s audio lenght, 2 channels - left and right).

Similarly, for a 28x28 RBG image (1,28,28,3) you would still use Conv2D, as the filter slides vertically and horizontally across the image.

Finally, for your video example, you need your convolutional filter to slide through the image (2D) PLUS the different frames. Therefore, you end up using Conv3D

arabinelli
  • 1,006
  • 1
  • 8
  • 19