2

Pre-trained ViT (Vision Transformer) models are usually trained on 224x224 images or 384x384 images. But I have to fine-tune a custom ViT model (all the layers of ViT plus some additional layers) on 640x640 images. How to handle the positional embeddings in this case? When I load the pre-trained weights from ViT-base-patch16-384 (available on huggingface) to my model using load_state_dict(), I am getting the following error:

size mismatch for pos_embed: copying a param with shape torch.Size([1, 577, 768]) from checkpoint, the shape in current model is torch.Size([1, 1601, 768]).

This error is expected, as the original model had 24x24=576 positional embeddings while my model has 40x40=1600 positional embeddings (as the size of the image is 640x640 now).

As far as I have studied, researchers use interpolation to solve this problem. My question is, if I use bicubic interpolation, does the following code resolve the issue I am facing? Will it properly interpolate the positional embeddings?

import torch
a = torch.rand(1, 577, 768)  # say it is the original pos_embed
a_temp = a.unsqueeze(0)  # shape of a_temp is (1,1,577,768)
# making it 4D because bicubic interpolation of torch works for only 4D and 5D

b = torch.nn.functional.interpolate(a_temp, [1601,768], mode='bicubic')
# shape of b is now (1,1,1601,768)

b = torch.squeeze(b,0)  # transferring b from 4D to 3D again
# shape of b is now (1,1601,768)

N.B. I have followed the video to implement the original ViT, then I added some additional layers.

Preetom Saha Arko
  • 2,588
  • 4
  • 21
  • 37

0 Answers0