Recently I came across some code that extracted (sliding-window style) a number of square patches from an RGB image (or set of them) of shape N x B x H x W. They did this as follows:
patch_width = 3
patches = image.permute(0,2,3,1).unfold(dim = 1, size = patch_width, stride = patch_width) \
.unfold(dim = 2, size = patch_width, stride = patch_width)
I understand that the unfold()
method "returns all all slices of size size
from self tensor in the dimension dim
," from reading the documentation, but try as I might, I just can't get a good intuition for why stacking two .unfold()
calls produces square patches. I get what happens when you use unfold()
once on a tensor. I don't get what happens when you call it twice successively along two different dimensions.
I've seen this approach used multiple times, always without a good explanation as to why it works (1, 2), and it's driving me bonkers. Why are the spatial dimensions H
and W
permuted to be dims 1 and 2, while the channel dim is set to 3? Why does unfolding the same way on dim 1, then on dim 2 result in square patch_width
by patch_width
patches?
Any insight would be hugely appreciated, even if it's just a link to an article I missed. I've been Googling for well over an hour now and have met very little success. Thank you!