How YOLO handles the input images of different sizes?

Question

I am working on custom object detection with YOLOv5. We can provide different input image sizes to the network. How can a DNN network accept different sizes of input? Does YOLO has different backbones for different input sizes?

When I give the argument --imgsz as 640, YOLO dataloader is resizing it to (384, 672, 3) and if the --imgsz is 320, the resized images are of size (224, 352, 2). As conventional CNNs accepts fixed square-sized (equal height and width) inputs, How is YOLO handling the variable image sizes?

CNNs do not require square images, where did you get this misconception? — Dr. Snoopy, Feb 17 '23 at 11:01

Leonardo Filipe · Answer 1 · 2023-06-19T09:49:57.480

Usually, in a CNN, what requires the fixed input sizes are the final layers. For YOLO, is the CSP-PAN Neck and the detection head. So what happens is that they pass the feature maps of the backbone through a Spatial Pyramid Pooling - Fast (SPPF) block. This block does a series of Pooling operations and outputs a fixed size vector. SPPF is an adaptation of Spatial Pyramid Pooling (SPP).

Here's a paper about SPP: https://paperswithcode.com/method/spatial-pyramid-pooling

How YOLO handles the input images of different sizes?

1 Answers1