59

While people usually tend to simply resize any image into a square while training a CNN (for example, resnet takes a 224x224 square image), that looks ugly to me, especially when the aspect ratio is not around 1.

(In fact, that might change ground truth, for example, the label that an expert might give the distorted image could be different than the original one).

So now I resize the image to, say, 224x160 , keeping the original ratio, and then I pad the image with 0s (by pasting it into a random location in a totally black 224x224 image).

My approach doesn't seem original to me, and yet I cannot find any information whatsoever about my approach versus the "usual" approach. Funky!

So, which approach is better? Why? (if the answer is data dependent, please share your thoughts regarding when one is preferable to the other.)

aerin
  • 20,607
  • 28
  • 102
  • 140
Yoni Keren
  • 1,170
  • 2
  • 13
  • 24
  • 4
    I have exactly the same concern. Strange that no one has answered after 1 month. Have you tried to post the question on the artificial intelligence Stack Exchange site? https://ai.stackexchange.com/ – Mario Stefanutti Jan 25 '18 at 22:21
  • 2
    I too have exactly the same concern. In my case though, by changing the aspect ratio, all my images would get distorted more or less the same. I use synthetic, concatenated NIST digits. In my case, I think it does not make much difference to classifying numbers. The only difference I can think of is that by resizing I could apply larger strides to convolutional layers without losing as much information as when padding the images instead. Thus, at the intersection from a convolutional to a fully connected layer, I would require less weights. – Sebastian Mar 22 '18 at 09:22
  • This answer helped me out in the end. https://stackoverflow.com/questions/41907598/how-to-train-images-when-they-have-different-size – Sebastian Mar 22 '18 at 09:27

2 Answers2

45

According to Jeremy Howard, padding a big piece of the image (64x160 pixels) will have the following effect: The CNN will have to learn that the black part of the image is not relevant and does not help distinguishing between the classes (in a classification setting), as there is no correlation between the pixels in the black part and belonging to a given class. As you are not hard coding this, the CNN will have to learn it by gradient descent, and this might probably take some epochs. For this reason, you can do it if you have lots of images and computational power, but if you are on a budget on any of them, resizing should work better.

David Masip
  • 2,146
  • 1
  • 26
  • 46
  • 2
    Sounds right so I've voted you up but: Let's say that you normalize all the pixels to [0,1]. So the black pixels are all 0s. So that during convolution any kernel will output 0 for those pixels. So...is it super easy to learn,and it's kind of automatic as well,right? I guess I can experiment with that since I have plenty of data,but still. – Yoni Keren Apr 18 '18 at 07:53
  • If all the pictures have the black padding below it, it is really easy. However, if some of them have it and some others not, and the ones that don't have it have very relevant information in there, I am not so sure about it. – David Masip Apr 18 '18 at 08:08
  • 2
    Not if you have to construct a classifier that distinguishes between circles and ellipses. – HelloGoodbye Apr 29 '18 at 22:47
  • 5
    One possibility is to add an extra colour channel which masks all the pixels that are padding or all the pixels that were part of the original image. That way, the network won't have to "figure out" whether something that is black was part of the original image, hence probably saving the network some work. – HelloGoodbye Apr 29 '18 at 22:51
  • 2
    To me that does not necessarily sound like a hard thing for a CNN to learn. In any case it's not clear that it's harder than learning highly squished visual features. If there are any empirical comparisons of the two approaches I think that would be most useful. – Denziloe Jul 07 '19 at 00:03
  • @HelloGoodbye that's a really nice approach – Harshit Jindal Mar 08 '20 at 16:14
15

Sorry, this is late but this answer is for anyone facing the same issue.

First, if scaling with changing the aspect ratio will affect some important features, then you have to use zero-padding.

Zero padding doesn't make it take longer for the network to learn because of the large black area itself but because of the different possible locations that the unpadded image could be inside the padded image since you can pad an image in many ways.

For areas with zero pixels, the output of the convolution operation is zero. The same with max or average pooling. Also, you can prove that the weight is not updated after backpropagation if the input associated with that weight is zero under some activation functions (e.g. relu, sigmoid). So the large area doesn't make any updates to the weights in this sense.

However, the relative position of the unpadded image inside the padded image does indeed affect training. This is not due to the convolution nor the pooling layers but the last fully connected layer(s). For example, if the unpadded image is on the left relative inside the padded image and the output of flattening the last convolution or pooling layer was [1, 0, 0] and the output for the same unpadded image but on the right relative inside the padded image was [0, 0, 1] then the fully connected layer(s) must learn that [1, 0, 0] and [0, 0, 1] are the same thing for a classification problem.

Therefore, learning the equivariance of different possible positions of the image is what makes training take more time. If you have 1,000,000 images then after resizing you will have the same number of images; on the other hand, if you pad and want to consider different possible locations (10 randomly for each image) then you will have 10,000,000 images. That is, training will take 10 times longer.

That said, it depends on your problem and what you want to achieve. Also, testing both methods will not hurt.

  • If we make sure the unpadded image is always in the middle with padding applied uniformly around it? is padding better in this case? – user3303020 Mar 09 '23 at 05:50