Why we use Unsqueeze() function while image processing?

Question

I was trying to work on a guided project and it was related to image processing. While working on the image processing the instructor used Unsqueeze(0) function for setting up the bed size. I would like to know what happens after changing the bed size. The code is given below for your reference.

I will be very thankfull for a quick response.

from torchvision import transforms as T

def preprocess(img_path,max_size = 500):
  image = Image.open(img_path).convert('RGB')

  if max(image.size) > max_size:
    size = max_size
  else:
    size = max(image.size)

  img_transform = T.Compose([
                             T.Resize(size),
                             T.ToTensor(),
                             T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
  ])

  image = img_transform(image)
  image = image.unsqueeze(0)
  return image

Jay Mody · Answer 1 · 2021-08-17T22:46:06.970

The unsqueeze is used here likely because you are working with a convolutional neural network.

When you load an image, it will typically have 3 dimensions, Width, Height, and Number of Color Channels. For black and white images, the number of color channels is 1, for colored images, there are 3 color channels (red, green, and blue, RGB). So, in your case, when you load the image and store it as a tensor, it has shape:

image = img_transform(image) # the resulting image has shape [3, H, W]

Note, the reason that the order of dimensions is [channels, height, width] and not some other order is because of PyTorch. Other libraries/software may do it differently.

However, 3 dimensions is not enough for a 2D Convolutional Neural Network. In deep learning, data is processed in batches. So, in the case of a convolutional neural network, instead of processing just one image at a time it will process N images at the same time in parallel. We call this collection of images a batch. So instead of dimensions [C, H, W], you'll have [N, C, H, W] (as seen here). For example, a batch of 64 colored 100 by 100 images, you would have the shape:

[64, 3, 100, 100]

Now, if you want to only process one image at a time, you still need to put it into batch form for a model to accept it. For example, if you have an image of shape [3, 100, 100] you'd need to convert it to [1, 3, 100, 100]. This is what unsqueeze(0) does:

image = img_transform(image) # [3, H, W]
image = image.unsqueeze(0) # [1, 3, H, W]

Thank you Jay for providing me with a good answer ....... keep helping jay .... thank you once again — Rachit S Garg, Aug 18 '21 at 09:00

score 0 · Accepted Answer · answered Aug 17 '21 at 22:30

After this line:

image = Image.open(img_path).convert('RGB')

image is, possibly, a 3D matrix of some sort. One way that information might be laid out is with dimensions [Channel, Row, Intensity], so you have:

an R matrix containing many rows each of which contains the intensity values of the Red channel
a G matrix containing many rows each of which contains the intensity values of the Green channel
a B matrix containing many rows each of which contains the intensity values of the Blue channel

Now, in machine learning, when we are training a model we are very rarely interested in having only one example. We training on batches of examples. A batch is simply a set of images stacked on top of each other, so we need to go from: [Channel, Row, Intensity] to [Batch, Channel, Row, Intensity].

This is what the unsqueeze(0) does, it adds a new, zeroth dimension that is used to make the images stackable.

PyTorch uses the [N, C, H, W] convention for dimensions for images, where N = batch size, C = color channels, H = height, and W = width. You can technically give it anything you want, but usually that's the convention for image processing. — Jay Mody, Aug 17 '21 at 22:39

Why we use Unsqueeze() function while image processing?

2 Answers2