Token indices sequence length warning while using pretrained Roberta model for sentiment analysis

Question

I am presently using a pretrained Roberta model to identify the sentiment scores and categories for my dataset. I am truncating the length to 512 but I still get the warning. What is going wrong here? I am using the following code to achieve this:

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax
model = f"j-hartmann/sentiment-roberta-large-english-3-classes"
tokenizer = AutoTokenizer.from_pretrained(model, model_max_length=512,truncation=True)
automodel = AutoModelForSequenceClassification.from_pretrained(model)

The warning that I am getting here:

Token indices sequence length is longer than the specified maximum sequence length for this model (627 > 512). Running this sequence through the model will result in indexing errors

if you just want to disable that warning then use this ```transformers.utils.logging.set_verbosity_error()``` — Ritwik, Jul 09 '23 at 19:50

score 1 · Accepted Answer · edited May 25 '23 at 13:48

You have not shared the code where you use tokenizer to encode/tokenize the inputs, so I'm taking my own example to explain how you can achieve this.

tokenizer = RobertaTokenizer.from_pretrained(model_path,model_max_length=512)

example usage:

text = "hello "*513 # example text with more than 512 words

tokenizer(text, max_length=512, truncation=True, padding='max_length')

# you may use tokenizer.encode_plus() or tokenizer.encode() based on your need with same parameters to get similar length tokens i.e 512

These above parameters will tokenize any string into max_length tokens by padding (if number of tokens is < max_length) or truncating (for tokens count > max_length).

Note: max_length cannot be greater than 512 for RoBERTa model.

Token indices sequence length warning while using pretrained Roberta model for sentiment analysis

1 Answers1

Linked