1

I am using the hugging face's BERTweet implementation (https://huggingface.co/docs/transformers/model_doc/bertweet), I want to encode some tweets and forward them for further processing (predictions). The problems is that when I try to encode a relatively long sentence, the model raises an error.

Example:

import torch
from transformers import AutoModel, AutoTokenizer
from DataLoader import DataLoader
   
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) # Automatic normalization of tweets by enabling normalization


line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry: SC has first two presumptive cases of coronavirus , DHEC confirms  "
input_ids = torch.tensor([tokenizer.encode(line)]) 

print(input_ids)
with torch.no_grad():
    features = bertweet(input_ids)

CONSOLE OUPTUT:

RuntimeError: The expanded size of the tensor (136) must match the existing size (130) at non-singleton dimension 1.  Target sizes: [1, 136].  Tensor sizes: [1, 130]

However, if you change the line to:

line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

, then the model encodes the sentence successfully. Is that an expected behaviour? I know that BERT has a maximum of 512 words in a sentence, and BERTweet is basically a fine-tuned BERT. Is it a good idea to just trim out longer sentences, would it be an acceptable solution to my problem? Thanks in advance.

Petar
  • 11
  • 1

1 Answers1

2

The original BERT model has indeed a 512 word limit per sequence, however a custom-made model such as BERTweet can have different features, depending on what the authors decide while pretraining from scratch. It seems that the authors have capped the max sequence length at 130 in your case, likely due to the 280 character tweet limit, and very high costs of training longer sequence models (training 120mln lines of data as 128-long sequences costs about $600 per epoch, while with 512, the price goes to around $10K per epoch).

To answer your question, you will need to somewhat truncate the inputs, so that they fit the 130 max length. I would recommend simply providing the necessary arguments to your tokenizer, such as max_length = 130, truncation = True, padding = 'max_length', etc. (plenty of examples online), then this all is done automatically and every tokenized sequence will have equal length of 130.

Note that this will discard all words in the sequence that appear after the 130 limit. You may consider splitting each line that has over 130 words, if you want to preserve that info.

One thing that bothers me, however, is: how on earth did you find a tweet that is over 130 words long? It doesn't seem possible, unless you find one that is full of single-character words, but that is more noise than data. So, you may want to re-evaluate your data pre-processing pipeline, to make sure all inputs are cleared properly. And otherwise, if you're using longer combinations of tweets or smth, then you may have to stop and consider, whether BERTweet is indeed the best model to use.