I am trying to use a xlm-roberta model I have fine-tuned for token classification, but no matter what I do, I always get as an output all tokens stuck together, like:
[{'entity_group': 'LABEL_0',
'score': 0.4824247,
'word': 'Thedogandthecatwenttothehouse',
'start': 0,
'end': 325}]
What could I do to get the words properly separated as an output as it happens with other models, like Bert?
I have tried to conduct the training with add_prefix_space=True
but it does not seem to have any effect:
tokenizer = AutoTokenizer.from_pretrained('MMG/xlm-roberta-large-ner-spanish', add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english", use_cache=None, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
pipe = pipeline(task="token-classification", model=model.to("cpu"), binary_output=True, tokenizer=tokenizer, aggregation_strategy="average")
Thanks a lot in advance for your help.