The main issue is I don't understand how the upsampling works. The AlexNet architecture has the following classifier
Dropout(),
Linear(in_features=9216, out_features=4096),
ReLU(),
Dropout(),
Linear(in_features=4096, out_features=4096),
ReLU(),
Linear(in_features=4096, out_features=1000)
The above is used on cifar so the num classes is 1000. If we change the classifier to use FCN (following FCN for Semantic Segmentation) and use VOC we'll have 20 classes (why does the figure show 21 channels?). Images are resized to 256x256. The architecture from the paper is shown below
Using pytorch I have the following
class AlexNetFCN(nn.Module):
def __init__(self):
super().__init__()
self.features = AlexNet().features
self.classifier = nn.Sequential(
nn.Dropout(),
nn.Conv2d(256, 4096, kernel_size=6),
nn.ReLU(),
nn.Dropout(),
nn.Conv2d(4096, 4096, kernel_size=1),
nn.ReLU(),
nn.Conv2d(4096, 20, kernel_size=1),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
I know after the feature portion my images are 7x7. How do i upsample back up to 256x256? Maybe nn.Bilinear
but this isn't clear to me from the paper.