What is the correct way to create a feature extractor for a hugging face (HF) ViT model?

Question

TLDR: is the correct way to extract features from a HF ViT model outputs.pooler_output or outputs.last_hidden_state[:, 0]? where outputs is outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs).

Given only the ViT model it's not clear what one should do to solve a vision classification problem. I eventually converged to this answer (but I am unsure if it is correct or the best anymore, will provide full code at the end):

        outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs)
        output: Tensor = self.dropout(outputs.last_hidden_state[:, 0])
        logits: Tensor = self.cls(output)

Intuitively it makes sense, we can to extract the features from the cls token position. However, once I printed all the layers for the ViTModel I would have personally chosen a different layer because it's right before the cls layer AND because printing the activations seem to be in a better range to be honest. I would have chosen the ones in the (pooler): ViTPooler(...) layer, right after the Tanh(). Doing that results in this:

outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs)

outputs.pooler_output
tensor([[-0.3976, -0.8454, -0.0601,  ..., -0.2804, -0.1822,  0.1917],
        [-0.3392, -0.0248,  0.1346,  ..., -0.5822,  0.8779,  0.4147],
        [-0.2980, -0.8038, -0.1146,  ...,  0.2431, -0.0963,  0.7844],
        ...,
        [-0.1237, -0.7514,  0.7388,  ..., -0.8551,  0.1512,  0.6157],
        [ 0.5351, -0.9040,  0.0387,  ..., -0.0773,  0.2704, -0.0311],
        [ 0.2142, -0.3138,  0.0426,  ..., -0.5943,  0.2873,  0.4420]],
       grad_fn=<TanhBackward>)
outputs.last_hidden_state[:, 0]
tensor([[ 5.7313e-01, -2.1335e+00,  2.0491e-01,  ..., -1.2373e-01,
         -2.0056e-01, -4.8167e-01],
        [ 5.3309e-02, -1.6563e+00,  1.5719e+00,  ..., -1.3617e+00,
         -3.0064e-01, -2.0056e-01],
        [-2.0633e-02, -2.1370e+00,  9.9927e-01,  ..., -2.3584e+00,
          8.6123e-01, -1.2759e+00],
        ...,
        [ 3.9583e-01, -1.3500e+00,  1.7638e+00,  ..., -9.9536e-01,
          1.0843e+00, -4.4368e-01],
        [ 1.6026e+00, -6.4654e-01,  2.4882e+00,  ..., -1.0347e+00,
         -1.3160e-03, -2.4357e+00],
        [-1.2769e-02, -9.6574e-01,  1.6432e+00,  ..., -7.9090e-01,
          6.1669e-01,  3.2990e-01]], grad_fn=<SelectBackward>)

and sums for sanity checks

outputs.pooler_output.sum()
tensor(3.8430, grad_fn=<SumBackward0>)
outputs.last_hidden_state[:, 0].sum()
tensor(-6.4373e-06, grad_fn=<SumBackward0>)

and shapes

outputs.pooler_output.shape
torch.Size([25, 768])
outputs.last_hidden_state[:, 0].shape
torch.Size([25, 768])

which for outputs.pooler_output.shape look much better behaves. But my forward pass uses outputs.last_hidden_state[:, 0] for some reason.

Which one should I have used?

Full code:


class ViTForImageClassificationUU(nn.Module):
    def __init__(self,
                 num_classes: int,
                 image_size: int,  # 224 inet, 32 cifar, 84 mi, 28 mnist, omni...
                 criterion: Optional[Union[None, Callable]] = None,
                 # Note: USL agent does criterion not model usually for me e.g nn.Criterion()
                 cls_p_dropout: float = 0.0,
                 pretrained_name: str = None,
                 vitconfig: ViTConfig = None,
                 ):
        """
        :param num_classes:
        :param pretrained_name: 'google/vit-base-patch16-224-in21k'  # what the diff with this one: "google/vit-base-patch16-224"
        """
        super().__init__()
        if vitconfig is not None:
            raise NotImplementedError
            self.vitconfig = vitconfig
            print(f'You gave a config so everyone other param given is going to be ignored.')
        elif pretrained_name is not None:
            raise NotImplementedError
            # self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
            self.model = ViTModel.from_pretrained(pretrained_name)
            print('Make sure you did not give a vitconfig or this pretrained name will be ignored.')
        else:
            self.num_classes = num_classes
            self.image_size = image_size
            self.vitconfig = ViTConfig(image_size=self.image_size)
            self.model = ViTModel(self.vitconfig)
        assert cls_p_dropout == 0.0, 'Error, for now only p dropout for cls is zero until we figure out if we need to ' \
                                     'change all the other p dropout layers too.'
        self.dropout = nn.Dropout(cls_p_dropout)
        self.cls = nn.Linear(self.model.config.hidden_size, num_classes)
        self.criterion = None if criterion is None else criterion

    def forward(self, batch_xs: Tensor, labels: Tensor = None) -> Tensor:
        """
        Forward pass of vit. I added the "missing" cls (and dropout layer before it) to act on the first cls
        token embedding. Remaining token embeddings are ignored/not used.

        I think the feature extractor only normalizes the data for you, doesn't seem to even make it into a seq, see:
        ...
        so idk why it's needed but an example using it can be found here:
            - colab https://colab.research.google.com/drive/1Z1lbR_oTSaeodv9tTm11uEhOjhkUx1L4?usp=sharing#scrollTo=cGDrb1Q4ToLN
            - blog with trainer https://huggingface.co/blog/fine-tune-vit
            - single PIL notebook https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Quick_demo_of_HuggingFace_version_of_Vision_Transformer_inference.ipynb
        """
        outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs)
        output: Tensor = self.dropout(outputs.last_hidden_state[:, 0])
        logits: Tensor = self.cls(output)
        if labels is None:
            assert logits.dtype == torch.float32
            return logits  # this is what my usl agent does ;)
        else:
            raise NotImplementedError
            assert labels.dtype == torch.long
            #   loss = self.criterion(logits.view(-1, self.num_classes), labels.view(-1))
            loss = self.criterion(logits, labels)
            return loss, logits

    def get_embedding(self, batch_xs: Tensor) -> Tensor:
        """
        Get the feature embedding of the first cls token.

        Details:
        By observing the ViTLayer, the (pooler) ViTPoooler(...) has an activation and a Tanh() layer.
        From playing around
        <TanhBackward>, so it seems that it the right one. Plus, printing
            outputs.pooler_output.sum()
            tensor(3.8430, grad_fn=<SumBackward0>)
        looks more sensible than trying to get the features for the cls position manually:
            outputs.last_hidden_state[:, 0, :].sum()
            tensor(-6.4373e-06, grad_fn=<SumBackward0>)
        which looked weird.
        """
        # outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs)
        outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs)
        feat = outputs.pooler_output
        # out = model.model(x)
        # hidden_states = out.last_hidden_state
        # # Get the CLS token's features (position 0)
        # cls_features = hidden_states[:, 0]
        # return out
        # Obtain the outputs from the base ViT model
        # outputs = self.model(pixel_values, *args, **kwargs)
        # pooled_output = outputs.pooler_output
        # image_representation = outputs.last_hidden_state[:, 0, :]
        return feat

    def _assert_its_random_model(self):
        from uutils.torch_uu import norm
        pre_trained_model = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
        print(f'----> {norm(pre_trained_model)=}')
        print(f'----> {norm(self)=}')
        assert norm(pre_trained_model) > norm(self), f'Random models usually have smaller weight size but got ' \
                                                     f'{norm(pre_trained_model)}{norm(self)}'


def get_vit_get_vit_model_and_model_hps(vitconfig: ViTConfig = None,
                                        num_classes: int = 5,
                                        image_size: int = 84,  # 224 inet, 32 cifar, 84 mi, 28 mnist, omni...
                                        criterion: Optional[Union[None, Callable]] = None,  # for me agent does it
                                        cls_p_dropout: float = 0.0,
                                        pretrained_name: str = None,
                                        ) -> tuple[nn.Module, dict]:
    """get vit for mi, only num_classes = 5 and image size 84 is needed. """
    model_hps: dict = dict(vitconfig=vitconfig,
                           num_classes=num_classes,
                           image_size=image_size,
                           criterion=criterion,
                           cls_p_dropout=cls_p_dropout,
                           pretrained_name=pretrained_name)
    model: nn.Module = ViTForImageClassificationUU(**model_hps)
    print('Its recommended to set args.allow_unused = True for ViT models.')
    return model, model_hps

def vit_forward_pass():
    # - for determinism
    import random
    import numpy as np
    random.seed(0)
    torch.manual_seed(0)
    np.random.seed(0)

    # - options for number of tasks/meta-batch size
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")

    # - get my vit model
    vitconfig: ViTConfig = ViTConfig()
    # model = ViTForImageClassificationUU(num_classes=64 + 1100, image_size=84)
    model = get_vit_get_vit_model_and_model_hps(vitconfig, num_classes=64 + 1100, image_size=84)
    criterion = nn.CrossEntropyLoss()
    # to device
    model.to(device)
    criterion.to(device)

    # - forward pass
    x = torch.rand(5, 3, 84, 84)
    y = torch.randint(0, 64 + 1100, (5,))
    logits = model(x)
    loss = criterion(logits, y)
    print(f'{loss=}')

cross: https://discuss.huggingface.co/t/what-is-the-correct-way-to-create-a-feature-extractor-for-a-hugging-face-hf-vit-model/34441

amiola · Answer 1 · 2023-03-23T18:54:16.290

TLDR: is the correct way to extract features from a HF ViT model outputs.pooler_output or outputs.last_hidden_state[:, 0]? where outputs is outputs: BaseModelOutputWithPooling = self.model(pixel_values=batch_xs).

As per my understanding, I would go with the second option outputs.last_hidden_state[:, 0]. Indeed, as per my interpretation and knowledge, using models as feature extractors means to freeze their bodies' weights, remove any heads they could have and retrieve the hidden states of the final layer. Then, especially for classification tasks, it is very common to just use the hidden state associated with the [CLS] token (such token nomenclature is also common to ViT and not only to NLP models, see the original implementation).

The ViT paper states the following:

[...] z_0^0 = x_class = x_0^{[CLS] token}, whose state at the output of the Transformer Encoder (z_L^0) serves as the image representation y. Both during pre-training and fine-tuning, a classification head is attached to z_L^0. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

Therefore, imo, the pooled output that you get from the HF ViTPooler is part of the 2-layer (1 hidden layer, 1 output layer) classification head which is used at pre-training time on the image classification task the authors considered in the paper. (The other self-supervised task mentioned in the paper, masked patch prediction, does not consider such pre-training setting as it can be seen in ViTForMaskedImageModeling). Said differently, for me ViTModel instances with add_pooling_layer=True (which effectively rely on ViTPooler) serve the only purpose of reproducing the pre-training setting and indeed are not used anywhere.

If, besides the HF implementation, we were also to consider the implementations in the original Google repo, here

[...]
x = self.encoder(name='Transformer', **self.transformer)(x, train=train)

if self.classifier == 'token':
    x = x[:, 0]
[...]

if self.representation_size is not None:
    x = nn.Dense(features=self.representation_size, name='pre_logits')(x)
    x = nn.tanh(x)
else:
    x = IdentityLayer(name='pre_logits')(x)

if self.num_classes:
    x = nn.Dense(
        features=self.num_classes,
        name='head',
        kernel_init=nn.initializers.zeros,
        bias_init=nn.initializers.constant(self.head_bias_init))(x)
return x

and in torchvision, __init__

heads_layers: OrderedDict[str, nn.Module] = OrderedDict()
if representation_size is None:
    heads_layers["head"] = nn.Linear(hidden_dim, num_classes)
else:
    heads_layers["pre_logits"] = nn.Linear(hidden_dim, representation_size)
    heads_layers["act"] = nn.Tanh()
    heads_layers["head"] = nn.Linear(representation_size, num_classes)

self.heads = nn.Sequential(heads_layers)

and forward method

def forward(self, x: torch.Tensor):
    [...]
    x = self.encoder(x)

    # Classifier "token" as used by standard language architectures
    x = x[:, 0]

    x = self.heads(x)

    return x

we would see that this idea seems to be valid.

Eventually, imo, this is further enforced by the fact that if you just run

from transformers import AutoModel
model = AutoModel.from_pretrained('google/vit-base-patch16-224')

model is a ViTModel instance (therefore it embeds the pooling layer by default)
you'll get a warning stating that

Some weights of the model checkpoint at google/vit-base-patch16-224 were not used when initializing ViTModel: ['classifier.weight', 'classifier.bias']

which is in line with the fact that we're not using any of the ViT models to be used in downstream tasks which have with a final (single linear layer) classification head on top of the encoder output (eg ViTForImageClassification)

but especially

you'll get a warning stating that

Some weights of ViTModel were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized: ['vit.pooler.dense.weight', 'vit.pooler.dense.bias']

which I would interpret as an evidence of the fact that the pooler hidden states (i.e. the pooler_output that you're considering) are not part of the pre-trained checkpoint.

I've found the following very useful on this topic (but not with reference to ViT necessarily):

All this said, the difference is subtle in my opinion. It would be nice if you could get an answer in the HF forum, too.

Not sure I get your question; that's the output of the [`ViTPooler`](https://github.com/huggingface/transformers/blob/68287689f2f0d8b7063c400230b3766987abf18d/src/transformers/models/vit/modeling_vit.py#L600), which is a classification head put on top of the [CLS] token's final hidden state. — amiola, Apr 06 '23 at 20:40

What is the correct way to create a feature extractor for a hugging face (HF) ViT model?

1 Answers1