xlm-roberta tokenizer sticks all words together

Question

I am trying to use a xlm-roberta model I have fine-tuned for token classification, but no matter what I do, I always get as an output all tokens stuck together, like:

[{'entity_group': 'LABEL_0',
'score': 0.4824247,
'word': 'Thedogandthecatwenttothehouse',
'start': 0,
'end': 325}]

What could I do to get the words properly separated as an output as it happens with other models, like Bert?

I have tried to conduct the training with add_prefix_space=True but it does not seem to have any effect:

tokenizer = AutoTokenizer.from_pretrained('MMG/xlm-roberta-large-ner-spanish', add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english", use_cache=None, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
pipe = pipeline(task="token-classification", model=model.to("cpu"), binary_output=True, tokenizer=tokenizer, aggregation_strategy="average")

Thanks a lot in advance for your help.

Is your model available somewhere? – cronoik Sep 01 '22 at 23:50 — cronoik, Sep 01 '22 at 23:50

score 0 · Accepted Answer · answered Sep 02 '22 at 06:45

The problem occurred because you've tried employing an average aggregation strategy. Since tokenizers' unit of calculation is subword, an aggregation strategy should be employed to reconstruct the original text. In this case, the strategy is determined as not appropriate. For more information check out here!

Another point for keeping in mind is aggregation between the model and tokenizer, each model extracted the problem space using one particular tokenizer and it would function well using the same one!

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large-finetuned-conll03-english', add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english", use_cache=None, num_labels=2, ignore_mismatched_sizes=True)
pipe = pipeline(task="token-classification", model=model.to("cpu"), binary_output=True, tokenizer=tokenizer)
pipe('This is a simple test!')

output:

[{'entity': 'LABEL_0',
  'score': 0.61674744,
  'index': 1,
  'word': '▁This',
  'start': 0,
  'end': 4},
 {'entity': 'LABEL_0',
  'score': 0.64719814,
  'index': 2,
  'word': '▁is',
  'start': 5,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.6912893,
  'index': 3,
  'word': '▁a',
  'start': 8,
  'end': 9},
 {'entity': 'LABEL_0',
  'score': 0.58730906,
  'index': 4,
  'word': '▁simple',
  'start': 10,
  'end': 16},
 {'entity': 'LABEL_0',
  'score': 0.62718534,
  'index': 5,
  'word': '▁test',
  'start': 17,
  'end': 21},
 {'entity': 'LABEL_0',
  'score': 0.733932,
  'index': 6,
  'word': '!',
  'start': 21,
  'end': 22}]

Many thanks, @meti. That was a perfectly didactic answer and solved the issue! — chancar, Sep 02 '22 at 19:49

xlm-roberta tokenizer sticks all words together

1 Answers1