I am using the hugging face's BERTweet implementation (https://huggingface.co/docs/transformers/model_doc/bertweet), I want to encode some tweets and forward them for further processing (predictions). The problems is that when I try to encode a relatively long sentence, the model raises an error.
Example:
import torch
from transformers import AutoModel, AutoTokenizer
from DataLoader import DataLoader
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) # Automatic normalization of tweets by enabling normalization
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms "
input_ids = torch.tensor([tokenizer.encode(line)])
print(input_ids)
with torch.no_grad():
features = bertweet(input_ids)
CONSOLE OUPTUT:
RuntimeError: The expanded size of the tensor (136) must match the existing size (130) at non-singleton dimension 1. Target sizes: [1, 136]. Tensor sizes: [1, 130]
However, if you change the line
to:
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
, then the model encodes the sentence successfully. Is that an expected behaviour? I know that BERT has a maximum of 512 words in a sentence, and BERTweet is basically a fine-tuned BERT. Is it a good idea to just trim out longer sentences, would it be an acceptable solution to my problem? Thanks in advance.