Huggingface token classification pipeline giving different outputs than just calling model() directly

Question

I am trying to mask named entities in text, using a roberta based model. The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using my GPU for computation, as the text cannot be put onto the GPU.

Due to this, i decided to put the model on the GPU, tokenize the text myself(using the same tokenizer i pass to the pipeline), put the tokens on the GPU and pass them to the model afterwards. This works, but the outputs of the model used directly like this and not via the pipeline differ significantly. I cant find a reason for this nor a way to fix it.

I tried reading through the token classification pipeline source code but couldnt find a difference in my usage compared to what the pipeline does.

Examples of code which produce different results:

Suggested usage in the model card:

ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])

'out' is now a list of lists of dictionary objects which hold information on each named entity in a given string in list of strings 'dataset['text']'.

My custom usage:

text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')

'label_ner_ids' is now a tensor of 2 dimensions, the elements of which represent the labels for each token in a given line of text, so label_ner_id[i,j] is the label for the j-th token in the i-th string of text in the list of strings 'text_batch'. The token labels here differ from the outputs of the pipeline usage.

I have since bypassed my issue by defining my own class MyTokenCalssificationPipeline and overriding the inherited preprocess function. For the preprocess function i just copied the source code and added a step to move all tensors to the cuda device. — Bunnyrabbit, Jan 27 '23 at 12:43

score 0 · Answer 1 · answered Mar 10 '23 at 21:01

The pipeline supports processing on GPU. All you need to do is to pass a device:

from transformers import pipeline

model_id = "xlm-roberta-large-finetuned-conll03-english"

classifier = pipeline("ner", model=model_id, device=TORCH_DEVICE, framework='pt')
out = classifier(dataset['text'])

Huggingface token classification pipeline giving different outputs than just calling model() directly

Examples of code which produce different results:

1 Answers1