I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with the same results.
Here is an example of a piece of text and the tokens that were created from it.
CTO at TLR Communications Pty Ltd
['[CLS]', 'CT', '##O', 'at', 'T', '##LR', 'Communications', 'P', '##ty', 'Ltd', '[SEP]']
And here is the code I am using to generate the tokens
tokenizer = AutoTokenizer.from_pretrained("tokenizer_bert.json")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))