I'm working on semantic image segmentation, with U-net based models. The input images are of different dimensions (between 300-600 pixels in each axis). My approach so far was to rescale the images to standard dims, and work from there.
Now I want to try a sliding window approach, extracting eg 64x64 patches from the original images (no rescaling), and train a model on that. I'm not sure about how to implement this efficiently.
For the training phase, I already have an online augmentation object (keras Sequence) for random transforms. Should I add a patch extraction process in there? If I do that, I'll be slicing numpy arrays and yielding them, and it doesn't sound very efficient. Is there a better way to do this?
And for the predicting phase, again - should I extract patches from the images in numpy, and feed them to the model? If I choose overlapping windows (eg patch dims 64x64 and strides 32x32), should I manually (in numpy) weight/average/concat the raw patch predictions from the model, to output a full-scale segmentation? Or is there a better way to handle this?
I'm using TF 2.1 btw. Any help is appreciated.