Is there an actual minimum input image size for popular computer vision models? (E.g., vgg, resnet, etc.)

Question

According to the documentation on pre-trained computer vision models for transfer learning (e.g., here), input images should come in "mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224".

However, when running transfer learning experiments on 3-channel images with height and width smaller than expected (e.g., smaller than 224), the networks generally run smoothly and often get decent performances.

Hence, it seems to me that the "minimum height and width" is somehow a convention and not a critical parameter. Am I missing something here?

score 3 · Answer 1 · answered Oct 06 '21 at 19:54

There is a limitation on your input size which corresponds to the receptive field of the last convolution layer of your network. Intuitively, you can observe the spatial dimensionality decreasing as you progress through the network. At least this is the case for feature extractor CNNs which aim at extracting feature embeddings from the input image. That is most pre-trained models such as vanilla VGG, and ResNets networks do not retain spatial dimensionality. If the input of a convolutional layer is smaller than the kernel size (even if/when padded), then you simply won't be able to perform the operation.

MykytaHordia · Answer 2 · 2022-12-25T19:26:04.890

TLDR: adaptive pooling layer

For example, the standard resnet50 model accepts input only in ranges 193-225, and this is due to the architecture and downscaling layers (see below). The only reason why the default pytorch model works is that it is using adaptive pooling layer which allows to not restrict input size. So it's gonna work but you should be ready for performance decay and other fun things :)

Hope you will find it useful:

Is there an actual minimum input image size for popular computer vision models? (E.g., vgg, resnet, etc.)

2 Answers2