3

I have been trying to visualize the outputs of a VGG-16 network. But the output seems to be just wrong. As you know the convolution doesn't translate the semantic segment of the picture. like for the following picture if the head is on the top part of the picture it should be on top of the picture still after the convolution is done. But it doesn't seem to be the case. I used the following code to extract the intermediate layers.

class vgg16(torch.nn.Module):
    def __init__(self, pretrained=True):
        super(vgg16, self).__init__()
        vgg_pretrained_features = tv.vgg16(pretrained=pretrained).features
        self.layerss = torch.nn.Sequential()
        for x in range(30):
            self.layerss.add_module(str(x), vgg_pretrained_features[x])
        self.layerss.eval()
    def forward(self, x):
      output=[]
      for i,layer in enumerate( self.layerss):
        # print (i)
        x=layer(x)
        output.append(x)
      return output
model=vgg16()
output=model.forward(img)
import matplotlib.pyplot as plt
plt.imshow(output[0][0][0].detach())

Here is the original picture and the output of the first channel of the first layer in the VGG network :

enter image description here

As you can see the face has moved all the way down and the neckless is all the way up and the overall structure of the picture is broken

Ivan
  • 34,531
  • 8
  • 55
  • 100

1 Answers1

0

doesn't translate the semantic segment of the picture,

I kind of understand where you're coming from. This might be true, but here is the thing: your model doesn't exclusively contain convolutions layers. It also has max-pooling layers (namely nn.MaxPool2d). These layers can indeed disturb the spatial coherence that is initially apparent in the input image.

Combined with a rather high receptive field (which is the case for this type of CNN), having the observed output is not inconceivable.

Then understanding why the result is this way is another problem to which I don't have the answer. The features you are extracting here should reflect higher-level information. These ultimately depend on the pretraining that was performed on the model, i.e. on which type of task and dataset the model was trained on prior to this inference.

Ivan
  • 34,531
  • 8
  • 55
  • 100