I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'
. I perform inference like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
I want to obtain results on very long texts, and since I know of the 512 token maximum capacity for both training and inference, I split my text
s in smaller chunks before passing those to the ner_pipeline
.
But, how do I split the text without actually tokenizing the texts myself in order to check for the length of each chunk? I want to make them as long as possible, but at the same time I don't want to exceed the maximum 512 tokens, risking that no predictions are computed on what's left of the sentence.
Is there a way to know if the texts I'm feeding exceed the 512 maximum tokens?