Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

Question

I feel confused when using the Roberta tokenizer in Huggingface.

>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'Ġbig', 'Ġ)', 'Ġthan', 'Ġthe', 'Ġdog', '.']
>>> x = tokenizer.encode("The tiger is ___ (big) than the dog.")
[0, 20, 23921, 16, 2165, 36, 8527, 43, 87, 5, 2335, 4, 2]
>>> x = tokenizer.encode("The tiger is ___ ( big ) than the dog.")
[0, 20, 23921, 16, 2165, 36, 380, 4839, 87, 5, 2335, 4, 2]
>>>

Question: (big) and ( big ) have different tokenization results, which result in different token id as well. Which one I should use? Does it mean that I should pre-tokenize the input first to make it ( big ) and go for RobertaTokenization? Or it doesn't really matter?

Secondly, it seems BertTokenizer has no such confusion:

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>> x = tokenizer.tokenize("The tiger is ___ ( big ) than the dog.")
['the', 'tiger', 'is', '_', '_', '_', '(', 'big', ')', 'than', 'the', 'dog', '.']
>>>

BertTokenizer gives me the same results using the wordpieces.

Any thoughts to help me better understand the RobertaTokenizer, which I know is using Byte-Pair Encoding?

score 10 · Accepted Answer · answered Jun 17 '20 at 09:57

Hugingface's Transformers are designed such that you are not supposed to do any pre-tokenization.

RoBERTa uses SentecePiece which has lossless pre-tokenization. I.e., when you have a tokenized text, you should always be able to say how the text looked like before tokenization. The Ġ (which is ▁, a weird Unicode underscore in the original SentecePiece) says that there should be a space when you detokenize. As a consequence big and ▁big end up as different tokens. Of course, in this particular context, it does not make much sense because it is obviously still the same word, but this the price you pay for lossless tokenization and also how RoBERTa was trained.

BERT uses WordPiece, which does not suffer from this problem. On the other hand, the mapping between the original string and the tokenized text is not as straightforward (which might be inconvenient, e.g., when you want to highlight something in a user-generated text).

Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

1 Answers1