4

I am running a sentence transformer model and trying to truncate my tokens, but it doesn't appear to be working. My code is

from transformers import AutoModel, AutoTokenizer
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
    
text_tokens = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
text_embedding = model(**text_tokens)["pooler_output"]

I keep getting the following warning:

Token indices sequence length is longer than the specified maximum sequence length 
for this model (909 > 512). Running this sequence through the model will result in 
indexing errors

I am wondering why setting truncation=True is not truncating my text to the desired length?

GSA
  • 751
  • 8
  • 12
  • Which version of transformers are you using? Please give the output of `transformers-cli env`. – kkgarg Aug 19 '21 at 19:26

1 Answers1

4

You need to add the max_length parameter while creating the tokenizer like below:

text_tokens = tokenizer(text, padding=True, max_length=512, truncation=True, return_tensors="pt")

Reason:

truncation=True without max_length parameter takes sequence length equal to maximum acceptable input length by the model.

It is 1e30 or 1000000000000000019884624838656 for this model. You can check by printing out tokenizer.model_max_length.

According to the Huggingface documentation about truncation,

True or 'only_first' truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).

kkgarg
  • 1,246
  • 1
  • 12
  • 28