Should I use a 2D or 3D convolution for a 3D grayscale image?

Question

I'm working with a TFRecord dataset consisting of multiple grayscale images of cross-sections of a 3D object, ending up with the shape [32, 256, 256]. The dimension of 32 represents the number of cross-sections, and it is significantly less than the other dimensions.

Because of this, I'm wondering if I could treat the data as 2D data with 32 channels instead of treating the data as 3D data with one channel, helping especially with regards to computational resources needed. I'm using TensorFlow right now with TPUs in Google Colab, and using tf.layers.conv2d instead of tf.layers.conv3d would save a lot of memory from less padding.

Is there any significant difference between the two methods, or is there any convention I should probably follow? Would using conv2d harm my accuracy in any way?

Think this totally depends on your problem. So you say you have multiple grayscale images. Can the 3D object be considered as one entity (i.e. are channels completely independent)? For example, 3D convolution is used for LIDAR data as that's 3D data. Hope this helps. — thushv89, Nov 17 '19 at 04:31

score 5 · Accepted Answer · edited Feb 24 '20 at 12:54

One of the main benefits of convolutional layers over fully connected 2D layers is that the the weights are local to a 2D area and shared over all 2D positions, i.e. a filter. This means that a discriminatory pattern in the image is learned once even if it occurs multiple times or in different positions. I.e. it is somewhat invariant to translation.

For a 3D signal you need to work out if you need the filter output to be invariant to depth, that is, the discriminatory features could occur at any or more than one depth in the image, or if the depth position of features is relatively fixed. The former would need 3D convolutions, the latter you could get away with 2D convolutions with lots of channels.

For example (making this up - I haven't worked on this), say you had a 3D scan of someone's lungs and you are trying to classify if there is a tumour or not. For this you would need 3D convolution because the combination of filters that represents "tumour" needs to be invariant to both the X, Y and Z positions of that tumour. If you used a 2D convolution in this case, the training set must have examples of the tumour at all different Z positions, otherwise the network will be very sensitive to the Z position.

BTW: CNN with LSTM is another approach to 3D data.

I'm analyzing 3D models of cells, but I centered all of them in the z dimension, or the dimension of 32. However, the centering may not be perfect, and there might be features across the z dimension that need to be invariant to the z positions. I'm going to continue to use 3D convolutions to just add to potential accuracy and generalizability. Thanks for the insight! — Justin Zhang, Nov 17 '19 at 16:49
Oh nice. This might help then: https://eraldoribeiro.github.io/project/pollen/ . It is for pollen grains but should be similar. — geometrikal, Nov 17 '19 at 20:02

Should I use a 2D or 3D convolution for a 3D grayscale image?

1 Answers1