I'm using the transformers FeatureExtractionPipeline like this:
from transformers import pipeline, LongformerTokenizer, LongformerModel
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
nlu_feature_pipeline = pipeline(task="feature-extraction", model=model, tokenizer=tokenizer)
However, it seems like the pipeline doesn't use truncation to ensure that no sequence is longer 4096, resulting in this:
Token indices sequence length is longer than the specified maximum sequence length for this model (8912 > 4096). Running this sequence through the model will result in indexing errors
Is there any way I can enable the truncation in the pipeline? Or is it somehow possible to maybe tokenize beforehand and then feed it into the pipeline?