I am trying to train my model which classifies images. The problem I have is, they have different sizes. how should i format my images/or model architecture ?
-
2Please show what you've tried so far and what appears to be not working for you. – Keith John Hutchison Jan 28 '17 at 08:05
-
16And bam there goes the code of Inception v4. I disagree with that off-the-shelf comment. A bit more input would be nice - like what kind of net we're talking about - but the downvotes are not justified at all. That _is_ a real problem there. – sunside Jan 28 '17 at 23:40
-
4The question is how does ImageNet format their image data to be useful for training? – mskw Oct 17 '17 at 14:15
2 Answers
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
- Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
- Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into
N
different images of correct size. - Pad the images with a solid color to a squared size, then resize.
- Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad
that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.

- 6,484
- 1
- 21
- 30

- 8,069
- 9
- 51
- 74
-
12This topic seems far more complicated when you're dealing with object detection and instance segmentation, because anchor box sizes which are also hyperparameters need to adjust if you have a dataset with high variance in image sizes. – CMCDragonkai Mar 15 '18 at 04:28
-
Aspect ratios play a pretty important role for a network that is to distinguish between circles and ellipses. – HelloGoodbye Apr 30 '18 at 02:19
-
That is true, but you can provide the original aspect ratio itself as an input to the network and/or make use of learned transformations (e.g. Spatial Transformer Networks). During training, padded batching might work, but it is essentially the same as aspect-correct resizing into a bigger frame. – sunside Apr 30 '18 at 07:44
-
1Another general observation is that batches do not necessarily have to have the same dimensions; the first batch could deal with 4:3 images, the second with 16:9 etc, as long as the dense layers are taken care of. – sunside Apr 30 '18 at 07:52
-
Is there anyone train the classification model using ROI pooling layer (https://deepsense.ai/region-of-interest-pooling-explained/)? – Jonny Vu Jan 25 '19 at 07:50
-
@CMCDragonkai : But, don't object detection algorithms such as **Yolo**, already resize the images to `(416, 416)` i.e multiples of 32 (in case of `Yolo`) , and I believe anchor sizes depends on the objects in the images, i.e, the sizes of the bounding boxes, and yes resizing will change them, but during training they will remain constant. As per **Yolo** the 3 anchor sizes are decided using **K means** [here](https://github.com/qqwweee/keras-yolo3/blob/master/kmeans.py) – aspiring1 Dec 03 '19 at 06:11
-
What do you mean by _Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?_ ? I have some license plate data where sizes are also very diverse, from 100x20 to 1100x200. I put the images in the center of a square but I am wondering if this is necessary? I understand your sentence about squashing in the sense that I can resize all images to e.g. 256x256 images and the distortion does not have effects? Because in my case, I don't see why the image deformation should have any effects?! – Tobitor Apr 28 '20 at 20:32
-
1@Tobitor, always make the inputs of the network as close to the actual (test, or inference-time) data as you can. If all your images are going much wider than high, you should also model your network to process your images like this. That said, if you cannot possibly say how your "usage" data will look like, you have to make some sacrifices during training. And in that case, resizing an image from 1000x200 to 256x256 is generally okay (imagine looking at that license plate at a 60 degree angle - its very roughly square now). – sunside Apr 29 '20 at 17:10
-
Ok, thanks a lot! :-) Another possibility would certainly be to make the images wider than high. for example 100x20 or is it in general better to have squares? I think that such a change of the images would also be advantageous because then we would have less data that the network would have to learn and thus a faster calculation would be possible than with images of sizes e.g. 1200x600 or something like this. Furthermore, such images would be better processed by the network than images with a huge black frame, I think. What do you think about that? – Tobitor Apr 30 '20 at 09:33
-
2@Tobitor There is no requirement at all for images to be square, it just happens to be the least bad tradeoff if you don't know the actual image sizes during inference. :^) As for size, the smaller the better, but the images need to be big enough to still capture the finest required details - generally speaking, just keep in mind that if you, as a human expert, cannot possibly determine what's in the image, the network won't be able, too. – sunside May 25 '20 at 12:23
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

- 633
- 7
- 7
-
Could you elaborate a bit on what is "spatial pyramid pooling" compared to regular pooling? – Matthieu Jun 16 '19 at 21:26
-
please read Spatial pyramid pooling in deep convolutional networks for visual recognition in https://blog.acolyer.org/2017/03/21/convolution-neural-nets-part-2/ @Matthieu – Asif Mohammed Aug 30 '19 at 04:35