transformers AutoTokenizer.tokenize introducing extra characters

Question

I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results.

Here is an example of a piece of text and the tokens that were created from it.

CTO at TLR Communications Pty Ltd
['[CLS]', 'CT', '##O', 'at', 'T', '##LR', 'Communications', 'P', '##ty', 'Ltd', '[SEP]']

And here is the code I am using to generate the tokens

tokenizer = AutoTokenizer.from_pretrained("tokenizer_bert.json")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))

Transformers use `subword tokenization`. You can read about it online. — kkgarg, Nov 11 '21 at 20:39

score 4 · Accepted Answer · answered Nov 13 '21 at 06:48

This is not an error but a feature. BERT and other transformers use WordPiece tokenization algorithm that tokenizes strings into either: (1) known words; or (2) "word pieces" for unknown words in the tokenizer vocabulary.

In your examle, words "CTO", "TLR", and "Pty" are not in the tokenizer vocabulary, and thus WordPiece splits them into subwords. E.g. the first subword is "CT" and another part is "##O" where "##" denotes that the subword is connected to the predecessor.

This is a great feature that allows to represent any string.

transformers AutoTokenizer.tokenize introducing extra characters

1 Answers1