3

I'm using the transformers FeatureExtractionPipeline like this:

from transformers import pipeline, LongformerTokenizer, LongformerModel

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

nlu_feature_pipeline = pipeline(task="feature-extraction", model=model, tokenizer=tokenizer)

However, it seems like the pipeline doesn't use truncation to ensure that no sequence is longer 4096, resulting in this:

Token indices sequence length is longer than the specified maximum sequence length for this model (8912 > 4096). Running this sequence through the model will result in indexing errors

Is there any way I can enable the truncation in the pipeline? Or is it somehow possible to maybe tokenize beforehand and then feed it into the pipeline?

Lukas Tilmann
  • 73
  • 1
  • 5
  • 2
    The "feature-extraction" pipeline does not support truncation because it would be contrary to its purpose. You have to truncate or split the input by yourself. – cronoik Feb 20 '21 at 07:00
  • Hmm... I don't think I understand "contrary to it's purpose" – Att Righ Nov 07 '21 at 21:54

0 Answers0