I want to add a classification layer in pytorch on top of the huggingface vilt transformer, so that I can classify my text labels.
Generally in normal settings vilt takes an image, question pair and outputs the answer of the question after forward pass
I Want to make the task a classification task instead of a text generation task. I have a set of labels which I want the vilt to assign which label has the highest probability of being the answer of the given question.
I'm completely new to the transformers and have very little idea of how this task can be achieved. Can someone please help me?
I checked this medium blog but couldn't make sense out of it.