What should be the input shape for 3D CNN on a sequence of images?

Question

https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html#conv3d Describes that the input to do convolution on 3D CNN is (N,C_in,D,H,W). Imagine if I have a sequence of images which I want to pass to 3D CNN. Am I right that:

N -> number of sequences (mini batch)
C_in -> number of channels (3 for rgb)
D -> Number of images in a sequence
H -> Height of one image in the sequence
W -> Width of one image in the sequence

The reason why I am asking is that when I stack image tensors: a = torch.stack([img1, img2, img3, img4, img5]) I get shape of a torch.Size([5, 3, 396, 247]), so is it compulsory to reshape my tensor to torch.Size([3, 5, 396, 247]) so that number of channels would go first or it does not matter inside the Dataloader?

Note that Dataloader would add one more dimension automatically which would correspond to N.

score 2 · Accepted Answer · answered Feb 14 '21 at 19:54

Yes it matters, you need to ensure that dimensions are ordered correctly (assuming you use DataLoader's default collate function). One way to do this is to invoke torch.stack using dim=1 instead of the default of dim=0. For example

a = torch.stack([img1, img2, img3, img4, img5], dim=1)

results in a being the desired shape of [3, 5, 396, 247].

What should be the input shape for 3D CNN on a sequence of images?

1 Answers1