semantic segmentation for large images

Question

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.

In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?

Why not train a model directly on the `3072` images directly? — drpng, Feb 14 '17 at 16:49
There only have 100 around 3072 large images. Also, training over the large images seems to be super slowly. I tried trained the model with 256*256, 128*128 and 64*64. The training time increases very fast as the patch size increases. — user288609, Feb 14 '17 at 22:14

score 7 · Answer 1 · answered Dec 12 '17 at 11:53

Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.

If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.

If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.

Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.

Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.

score 2 · Accepted Answer · answered Feb 15 '17 at 05:59

2

Input image data: I would not advice feeding the big image (3072x3072) directly into the caffe. Batch of small images will fit better into the memory and parallel programming will too come into play. Data Augmentation will also be feasible.

Output for big Image: As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size. Then you will get 3072x3072 segmented image as output.

answered Feb 15 '17 at 05:59

curio17

660
1
6
15

Hi Kishen, thanks for the reply. What do you mean "recast" in "recast the input size of FCN to 3072x3072 during test phase." Do I have to change the FCN architecture(in specific, the first layer shape) even after finish the training process? I am using Keras, generally it will load the trained weight in the prediction phase. If the architecture is changed, the load weight will not work. – user288609 Feb 15 '17 at 15:40
Try this http://stackoverflow.com/questions/39814777/can-keras-deal-with-input-images-with-different-size – curio17 Feb 15 '17 at 16:14
Thanks for this, which is what I am looking for. By the way, it seems that we can feed training process with different size of images, is that right? – user288609 Feb 15 '17 at 20:33
It also depends on the parameters in your architecture. You will have to ensure that after several convolutional,pooling,upsampling and deconvolutional layers you will get the segmented image of same size.So ,for a given architecture only selected image sizes will produce an output with input image size. – curio17 Feb 16 '17 at 02:33
are you suggesting training on 128x128 images and then validating on 3072x3072 images? – pietz Dec 12 '17 at 08:07
It in someway depends on the net being used. If your net has ability to work in fcn mode. Then it would have this advantage of train on small images and work on bigger images. – curio17 Dec 12 '17 at 08:16
i agree that this is possible on a technical level, but i would not recommend doing this. your network will not be able to learn global information of the frame. do you have an example where this worked with good results? I'm quickly going to type an answer myself. – pietz Dec 12 '17 at 08:29
My answer was specific to this question, you can always give a generic answer. – curio17 Dec 12 '17 at 08:31

semantic segmentation for large images

2 Answers2

Linked