I am trying to tokenize text by loading a vocab in huggingface.
vocab_path = '....' ## have a local vocab path
tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=False)
text = 'The Quick Brown fox'
output = tokenizer.encode(text)
output.tokens
Out[1326]: ['[CLS]', '[UNK]', '[UNK]', '[UNK]', 'fox', '[SEP]']
The vocab I am using has everything in lower case. However, I want to tokenize it keeping the caps preserved.
Desired Output:
['[CLS]', 'The', 'Quick', 'Brown', 'fox', '[SEP]']
When I use lowercase=True, it recognizes the words
tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=True)
output = tokenizer.encode(text)
output.tokens
Out[1344]: ['[CLS]', 'the', 'quick', 'brown', 'fox', '[SEP]']
How do I get desired output using the lowercase vocab ?