pytorch sliding window with unfold & fold

Question

I'm trying to use a simple ssd that was trained on 300x300 data with annotated bounding boxes. If I crop the images manually, it works correctly but with full size images it fails (obviously) due to resizing large images to 300x300 removes many visual features.

I figured the good old sliding window will work here, but I have some problems rebuilding the images with detection and must admit I'm a bit clueless on how to approach, What I have so far is:

at first, I tried this:

chips = F.unfold(img_t.data, kernel_size=300)

following some examples from stack overflow, but this gives me error Input Error: Only 4D input Tensors are supported (got 3D)

so following some more googling I found something that works:

        patch_w = 300
        patch_h = 300
        patches = img_t.data.unfold(0, 3, 3).unfold(1, patch_w, patch_h).unfold(2, patch_w, patch_h)
#Visualise small part:
        fig = plt.figure(figsize=(4, 4))
        fig.tight_layout()
        plt.subplots_adjust(left=0.1, bottom=0.1, right=0.9, top=0.9, wspace=0.01, hspace=0.01)
        for i in range(4):
            for j in range(4):
                inp = transp(patches[0][i][j])
                inp = np.array(inp)

                ax = fig.add_subplot(4, 4, ((i*4)+j)+1, xticks=[], yticks=[])
                plt.imshow(inp)
        plt.show()

I then feed the patches to my detector and it looks more or less ok, but there's no overlap (object can be cut into pieces and missed) and more importantly, I can't reverse the process with unfold without getting drowned in exceptions.

I'm not adamant on using the fold/unfold combination for the task as what I really want is to be able to feed large image into the network in a way that will preserve as much information as possible, mark down the detections and rebuild image with bounding boxes from the smaller patches.

What I came up with is this:

new_im = Image.new("RGB", (300*dims[1], 300*dims[0]))
idx = 0
for i in range(dims[0]):
    for j in range(dims[1]):
        new_im.paste(tiles[idx], (j*300, i*300))
        idx += 1
new_im.show()

which rebuilds the image back, but in a very artificial way, where the detector annotates image cropping and returns them as image list which I here rebuild which is both ugly and inefficient.

After a bit of fiddling, I got it to work, but there comes a peculiarity of pytorch.It adds overlapping parts of patches instead of averaging them (see image). How can I fix it? I realise normalisation won't do anything here since it would normalise the good pixels as well, so it needs to just average overlapping pixels. Also, please note the image was cropped erroneously. Simple code to reproduce:

def fold_unfold(img_path):
    transt = transforms.Compose([transforms.ToTensor(),
    # transforms.Normalize(mean=[0.5,0.5, 0.5], std=[0.5, 0.5, 0.5])
        ])

    transp = transforms.Compose([
        # transforms.Normalize(mean=[0.5,0.5, 0.5], std=[0.5, 0.5, 0.5]),
        transforms.ToPILImage()
    ])
    img_t = transt(Image.open(img_path))
    img_t = img_t.unsqueeze(0)
    kernel = 300
    stride = 200
    img_shape = img_t.shape
    B, C, H, W = img_shape
    # number of pixels missing:
    pad_w = W % kernel
    pad_h = H % kernel
    # Padding the **INPUT** image with missing pixels:
    img_t = F.pad(input=img_t, pad=(pad_w//2, pad_w-pad_w//2, pad_h//2, pad_h-pad_h//2), mode='constant', value=0)
    img_shape = img_t.shape
    B, C, H, W = img_shape
    print("\n-----input shape: ", img_shape)

    patches = img_t.unfold(3, kernel, stride).unfold(2, kernel, stride).permute(0,1,2,3,5,4)

    print("\n-----patches shape:", patches.shape)
    # reshape output to match F.fold input
    patches = patches.contiguous().view(B, C, -1, kernel*kernel)
    print("\n", patches.shape) # [B, C, nb_patches_all, kernel_size*kernel_size]
    patches = patches.permute(0, 1, 3, 2) 
    print("\n", patches.shape) # [B, C, kernel_size*kernel_size, nb_patches_all]
    patches = patches.contiguous().view(B, C*kernel*kernel, -1)
    print("\n", patches.shape) # [B, C*prod(kernel_size), L] as expected by Fold
    # https://pytorch.org/docs/stable/nn.html#torch.nn.Fold

    output = F.fold(patches, output_size=(H, W), kernel_size=kernel, stride=stride)
    # mask that mimics the original folding:
    recovery_mask = F.fold(torch.ones_like(patches), output_size=(H,W), kernel_size=kernel, stride=stride)
    output = output/recovery_mask

    print(output.shape) # [B, C, H, W]
    aspil = transp(output[0])
    aspil.show()

still, the image is cropped quite a lot so something is still wrong:

Finally, getting the cropping done Code updated to the working version The problem is coming from the way pytorch does the unfolding. The unfold from tensor method doesn't zero pad automatically, but rather stops the cropping by cutting off the part that didn't fit. I solved it by zero padding the tensor before cropping it.

user3002166 · Answer 1 · 2021-03-17T11:01:34.687

I'm posting that way so it's easier for others to find a solution if they encounter similar problem.

Key Highlights:

pytorch unfold will crop out part of the image that doesn't fit into the sliding window used. (ex. with 300x300 image and 100x100 window, nothing would get cropped, but with 290x290 image and same window the cropping will well... crop out the last 90 rows and columns of the original image. Solution is to zero pad the image preemptively to match size of the sliding window
the size of the input image will change after zero padding (no surprise here) but it's easy to forget about that when reconstructing the original image.
Ideally you may want to crop the image to original size in the end, but with sliding window approach it makes more sense for my application to keep the padding around the image so that the center of my detector can be applied to edges of the image too.
Unfolding: I couldn't find a practical difference between patches = img_t.unfold(3, kernel, stride).unfold(2, kernel, stride).permute(0,1,2,3,5,4) and patches = img_t.unfold(2, kernel, stride).unfold(3, kernel, stride) so explanation on that would be welcome.
Image tensor must be reshaped a number of times before it is possible to fold it back into the original (padded!) image.
normalisation - not in the sense of image transform but rather to revert the effect of sliding window overlap. Another peculiarity of pytorch I found is the way it pastes tensors onto one another when folding overlapping patches. Instead of taking the average of overlap area, it adds them together. This can be reverted with form of overlap mask. This has an exact shape of the produced patches and value of 1 for each point. After folding, the value of each pixel/point will be equal to the number of stacked folds. Ultimately this is the denominator for averaging colors for the overlaps.

The code that ultimately worked for me:

import torch
from torchvision.transforms import transforms
import torch.nn.functional as F
from PIL import Image

img_path = 'filename.jpg'


def fold_unfold(img_path):
    transt = transforms.Compose([transforms.ToTensor(),
       # transforms.Normalize(mean=[0.5,0.5, 0.5], std=[0.5, 0.5, 0.5])
    ])

    transp = transforms.Compose([
        # transforms.Normalize(mean=[0.5,0.5, 0.5], std=[0.5, 0.5, 0.5]),
        transforms.ToPILImage()
    ])
    img_t = transt(Image.open(img_path))
    img_t = img_t.unsqueeze(0)
    kernel = 300
    stride = 200  # smaller than kernel will lead to overlap
    img_shape = img_t.shape
    B, C, H, W = img_shape  # Batch size, here 1, channels (3), height, width
    # number of pixels missing in each dimension:
    pad_w = W % kernel
    pad_h = H % kernel
    # Padding the **INPUT** image with missing pixels:
    img_t = F.pad(input=img_t, pad=(pad_w//2, pad_w-pad_w//2,
                                    pad_h//2, pad_h-pad_h//2), mode='constant', value=0)
    img_shape = img_t.shape
    # UPDATE the shape information to account for padding
    B, C, H, W = img_shape
    print("\n----- input shape: ", img_shape)

    patches = img_t.unfold(3, kernel, stride).unfold(2, kernel, stride).permute(0, 1, 2, 3, 5, 4)

    print("\n----- patches shape:", patches.shape)
    # reshape output to match F.fold input
    patches = patches.contiguous().view(B, C, -1, kernel*kernel)
    print("\n", patches.shape) # [B, C, nb_patches_all, kernel_size*kernel_size]
    patches = patches.permute(0, 1, 3, 2) 
    print("\n", patches.shape) # [B, C, kernel_size*kernel_size, nb_patches_all]
    patches = patches.contiguous().view(B, C*kernel*kernel, -1)
    print("\n", patches.shape) # [B, C*prod(kernel_size), L] as expected by Fold
    # https://pytorch.org/docs/stable/nn.html#torch.nn.Fold

    output = F.fold(patches, output_size=(H, W),
                    kernel_size=kernel, stride=stride)
    # mask that mimics the original folding:
    recovery_mask = F.fold(torch.ones_like(patches), output_size=(
        H, W), kernel_size=kernel, stride=stride)
    output = output/recovery_mask

    print(output.shape)  # [B, C, H, W]
    aspil = transp(output[0])
    aspil.show()


fold_unfold(img_path)

pytorch sliding window with unfold & fold

1 Answers1