Clarification of a Faster R-CNN torchvision implementation

Question

I'm digging through the source code of the Faster R-CNN implementation of torchvision and I'm facing some things I don't quite understand. Namely, assuming that I want to create a Faster R-CNN model, not pretrained on COCO, with a backbone pre-trained on ImageNet, and then just get the backbone I do the following:

plain_backbone = fasterrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=True).backbone.body

Which is consistent with how the backbone is set-up as indicated here and here. However, when I pass an image through the model, the results don't correspond to what I would obtain if I just set-up a resnet50 directly. Namely:

# Regular resnet50, pretrained on ImageNet, without the classifier and the average pooling layer
resnet50_1 = torch.nn.Sequential(*(list(torchvision.models.resnet50(pretrained=True).children())[:-2]))
resnet50_1.eval()
# Resnet50, extract from the Faster R-CNN, also pre-trained on ImageNet
resnet50_2 = fasterrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=True).backbone.body
resnet50_2.eval()
# Loading a random image, converted to torch.Tensor, rescalled to [0, 1] (not that it matters)
image = transforms.ToTensor()(Image.open("random_images/random.jpg")).unsqueeze(0)
# Obtaining the model outputs
with torch.no_grad():
    # Output from the regular resnet50
    output_1 = resnet50_1(image)
    # Output from the resnet50 extracted from the Faster R-CNN
    output_2 = resnet50_2(image)["3"]
    # Their outputs aren't the same, which I would assume they should be
    np.testing.assert_almost_equal(output_1.numpy(), output_2.numpy())

Looking forward to your thoughts!

I verified this too! Both seem to load weights from the same checkpoint but differ in the output. The `IntermediateLayerGetter` class that wraps the backbone in `resnet50_2` could be responsible for this, although, I am yet to investigate more. — S V Praveen, Dec 15 '20 at 13:36
Yeah, that's what I found confusing. The `IntermediateLayerGetter` is a wrapper to obtain the outputs from the layers easily based on what I understood. However, let me know what you find :) — gorjan, Dec 15 '20 at 13:43

score 3 · Accepted Answer · answered Dec 15 '20 at 14:27

This is because fasterrcnn_resnet50_fpn uses a custom normalization layer (FrozenBatchNorm2d) instead of the default BatchNorm2D. They are very similar but I suspect that the small numerical differences are causing issues.

It will pass the check if you specify the same normalization layer to be used for the standard resnet:

import torch
import torchvision
from torchvision.models.detection.faster_rcnn import fasterrcnn_resnet50_fpn
import numpy as np
from torchvision.ops import misc as misc_nn_ops

# Regular resnet50, pretrained on ImageNet, without the classifier and the average pooling layer
resnet50_1 = torch.nn.Sequential(*(list(torchvision.models.resnet50(pretrained=True, norm_layer=misc_nn_ops.FrozenBatchNorm2d).children())[:-2]))
resnet50_1.eval()
# Resnet50, extract from the Faster R-CNN, also pre-trained on ImageNet
resnet50_2 = fasterrcnn_resnet50_fpn(pretrained=False, pretrained_backbone=True).backbone.body
resnet50_2.eval()
# am too lazy to get a real image
image = torch.ones((1, 3, 224, 224))
# Obtaining the model outputs
with torch.no_grad():
    # Output from the regular resnet50
    output_1 = resnet50_1(image)
    # Output from the resnet50 extracted from the Faster R-CNN
    output_2 = resnet50_2(image)["3"]
    # Passes
    np.testing.assert_almost_equal(output_1.numpy(), output_2.numpy())

Good catch. For the record, can you elaborate on the difference between BatchNorm and FrozenBatchNorm? Btw, I am accepting your answer now. — gorjan, Dec 15 '20 at 14:34
@gorjan FrozenBatchNorm is implemented [here](https://github.com/pytorch/vision/blob/1a300d84da41bfffbb6a53c8b805f123d2060c0e/torchvision/ops/misc.py#L45) with pure PyTorch while BatchNorm is implemented in C++. I think the only reason that FrozenBatchNorm exists is that they want BN to stay in `eval` mode and not update its parameters with minimum work required from the user. Any differences in the output should only be numerical and (I believe) not substantial. — hkchengrex, Dec 15 '20 at 14:48
I just found an official explanation [here](https://github.com/facebookresearch/maskrcnn-benchmark/issues/267). — hkchengrex, Dec 15 '20 at 14:52

Clarification of a Faster R-CNN torchvision implementation

1 Answers1