10

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my dataframe. If it is longer than 512 I split the sentence into sequences each sequence has 512 token. And then do tokenizer encode. The length of the seqience is 512, however, after doing tokenize encode the length becomes 707 and I get this error.

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

Here is the code I used to do the preivous steps:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math

pred=[]
if (len(test_sentence_in_df.split())>512):
  
  n=math.ceil(len(test_sentence_in_df.split())/512)
  for i in range(n):
    if (i==(n-1)):
      print(i)
      test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
    else:
      print("i in else",str(i))
      test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
      
      #print(len(test_sentence.split()))  ##here's the length is 512
    tokenized_sentence = tokenizer.encode(test_sentence)
    input_ids = torch.tensor([tokenized_sentence]).cuda()
    print(len(tokenized_sentence)) #### here's the length is 707
    with torch.no_grad():
      output = model(input_ids)
      label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
    pred.append(label_indices)

print(pred)
Mee
  • 1,413
  • 5
  • 24
  • 40

2 Answers2

11

This is because, BERT uses word-piece tokenization. So, when some of the words are not in the vocabulary, it splits the words to it's word pieces. For example: if the word playing is not in the vocabulary, it can split down to play, ##ing. This increases the amount of tokens in a given sentence after tokenization. You can specify certain parameters to get fixed length tokenization:

tokenized_sentence = tokenizer.encode(test_sentence, padding=True, truncation=True,max_length=50, add_special_tokens = True)

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
  • 1
    if the `encode()` function doesn't work, then `batch_encode_plus()` definitely works. – Ashwin Geet D'Sa Oct 13 '20 at 08:36
  • 2
    Just as a side-note: This error ist very likely to appear if a monolingual Bert model is used on another language ;) – Chiarcos Sep 17 '21 at 08:32
  • @AshwinGeetD'Sa. I'm using batch_encode_plus() and I still get this error. This is the code I use: `tokenizer.batch_encode_plus( df.abstract.values, add_special_tokens=True, return_attention_mask=True, padding='longest', max_length=256, return_tensors='pt' )` – mah65 Sep 27 '21 at 08:50
  • What is the error? – Ashwin Geet D'Sa Sep 27 '21 at 09:57
  • 2
    This does not show us how to solve the issue within the pipeline() setup of transformers. Passing these args to AutoTokenizer.from_pretrained() doesn't affect the behavior when you call the pipeline. – lrthistlethwaite Oct 15 '21 at 18:03
  • for pipeline use this https://stackoverflow.com/a/70626497/11170350 – Talha Anwar Jun 15 '23 at 07:48
8

If you are running a transformer model with HuggingFace, there is a chance that one of the input sentences is longer than 512 tokens. Either truncate or split you sentences. I suspect the shorter sentences are padded to 512 tokens.

yangliu2
  • 391
  • 4
  • 15