Token indices sequence length error when using encode_plus method

Question

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.

I am using data from this Kaggle competition. Given a question title, question body and answer, the model must predict 30 values (regression problem). My goal is to get the following encoding as input to BERT:

[CLS] question_title question_body [SEP] answer [SEP]

However, when I try to use

tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")

and encode only the second input from train.csv as follows:

inputs = tokenizer.encode_plus(
            df_train["question_title"].values[1] + " " + df_train["question_body"].values[1], # first sequence to be encoded
            df_train["answer"].values[1], # second sequence to be encoded
            add_special_tokens=True, # [CLS] and 2x [SEP] 
            max_len = 512,
            pad_to_max_length=True
            )

I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (46 > 512). Running this sequence through the model will result in indexing errors

It says that the length of the token indices is longer than the specified maximum sequence length, but this is not true (as you can see, 46 is not > 512).

This happens for several of the rows in df_train. Am I doing something wrong here?

Which version of `transformers` are you using? Also, can you please post the full error message in the question, and not in the title? — dennlinger, Apr 20 '20 at 12:35
I am using the latest public release (2.8.0) in Google Colab. Ok, will update it. — Niels, Apr 20 '20 at 12:37

score 0 · Answer 1 · answered Apr 30 '20 at 20:50

The model 'bert-base-uncased' is not pre-trained to handle the long texts of [CLS] + Question + [SEP] + Context + [SEP]. Any other model from Huggingface models dedicated especially for the squad question-answer datasets would handle the long sequence.

For example if I am using the ALBERT model, I would go for 'ktrapeznikov/albert-xlarge-v2-squad-v2' model.

Token indices sequence length error when using encode_plus method

1 Answers1