Strategy to put and get large images in VGG neural networks

Question

I'm using a transfert-style based deep learning approach that use VGG (neural network). The latter works well with images of small size (512x512pixels), however it provides distorted results when input images are large (size > 1500px). The author of the approach suggested to divide the input large image to portions and perform style-transfert to portion1 and then to portion2 and finally concatenate the two portions to have a final large result image, because VGG was made for small images... The problem with this approach is that the resulting image will have some inconsistent regions at the level of areas where the portions were "glued". How can I correct these areas ? Is the an alternative approach to this dividing method ?

I would also try asking this on https://ai.stackexchange.com/ — Gulzar, Sep 10 '20 at 12:11

score 4 · Answer 1 · answered Sep 10 '20 at 14:59

Welcome to SO, jeanluc. Great first question.

When you say VGG, I expect you're referring to VGG-16. This architecture uses fully connected layers in the end which means you can only use it with images of a certain size. I believe the ImageNet default is 224x224 pixels.

If you want to use VGG-16 without modifications, you MUST use images of this size. However, many people remove the fully connected layers in the end (especially in the context of style transfer) in order to feed in any size they want.

Any size? Well, you probably want to make sure that the images are multiples of 32 because VGG-16 comes with 5 MaxPooling operations that half the dimensions every time.

But just because the network can now digest images of any size doesn't mean the predictions will be meaningful. VGG-16 learned what 1000 different objects look like on a scale of 224px. Using a 1500px of a cat might not activate the cat related neurons. Is that a problem?

It depends on your use case. I wouldn't trust VGG-16 to classify these high resolution images in the context of ImageNet but that is not what you're after. You want to use a pretrained VGG-16 because it should have learned some abilities that may come in handy in the context of style transfer. And this is usually true no matter the size of your input. It's almost always preferred to start out with a pretrained model in comparison to starting from scratch. You probably want to think about finetuning this model for your task because A) style transfer is quite different from classification and B) you're using a completely different scale of images.

I've never found this recommended patch based approach to help because of precisely the same problems you're experiencing. While CNN learn to recognize local pattern in an images, they will also learn global distributions which is why this doesn't work nicely. You can always try to merge patches using interpolation techniques but personally I wouldn't waste time on that.

Instead just feed in the full image like you mentioned which should work after you removed the fully connected layers. The scale will be off but there's little you can do if you really want high resolution inputs. Finetune VGG-16 so it can learn to adapt to your use case at hand.

In case you don't want to finetune, I don't think there's anything else you can do. Use the transformation/scale the network was trained on or accept less than optimal performance when you change the resolution.

thank you for your great response. Well the approach uses the "imagenet-vgg-verydeep-19" from "https://www.vlfeat.org/matconvnet/pretrained/" . I think there is just files with .mat extension and hence cannot modify them. When you say "remove the fully connected layers", you mean that I must retrain the vgg network ? — jeanluc, Sep 10 '20 at 16:10
No, you don't have to. Convolutions are local filters that traverse the image. They don't care how large the image is. VGG-16 and VGG-19 end with fully connected layers that need to know the input resolution. Specifically both end with a 7x7 image with 512 channels. That's why the following FC layer needs 25088 (7x7x512) input channels. If you remove these 2 FC layers you will not get a 1x1000 dim vector, but a 7x7x512 tensor instead. — pietz, Sep 10 '20 at 16:34
can you more explain how can I resolve this considering your proposition? Many thanks — jeanluc, Sep 12 '20 at 20:58
Is there any example on finetuning vgg-16 for such use case? — June Wang, Dec 01 '21 at 11:52

Strategy to put and get large images in VGG neural networks

1 Answers1