I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus
method provided in the Transformers library.
I am using data from this Kaggle competition. Given a question title, question body and answer, the model must predict 30 values (regression problem). My goal is to get the following encoding as input to BERT:
[CLS] question_title question_body [SEP] answer [SEP]
However, when I try to use
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
and encode only the second input from train.csv as follows:
inputs = tokenizer.encode_plus(
df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
df_train["answer"].values[1], # second sequence to be encoded
add_special_tokens=True, # [CLS] and 2x [SEP]
max_len = 512,
pad_to_max_length=True
)
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). Running this sequence through the model will result in indexing errors
It says that the length of the token indices is longer than the specified maximum sequence length, but this is not true (as you can see, 46 is not > 512).
This happens for several of the rows in df_train
. Am I doing something wrong here?