0

I am trying to tokenize text by loading a vocab in huggingface.

vocab_path = '....' ## have a local vocab path
tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=False)
text = 'The Quick Brown fox'
output = tokenizer.encode(text)
output.tokens
Out[1326]: ['[CLS]', '[UNK]', '[UNK]', '[UNK]', 'fox', '[SEP]']

The vocab I am using has everything in lower case. However, I want to tokenize it keeping the caps preserved.

Desired Output:

['[CLS]', 'The', 'Quick', 'Brown', 'fox', '[SEP]']

When I use lowercase=True, it recognizes the words

tokenizer = BertWordPieceTokenizer(os.path.join(vocab_path, "vocab.txt"), lowercase=True)

output = tokenizer.encode(text)

output.tokens
Out[1344]: ['[CLS]', 'the', 'quick', 'brown', 'fox', '[SEP]']

How do I get desired output using the lowercase vocab ?

Palash Jhamb
  • 605
  • 6
  • 15

1 Answers1

0

The model you are using is an uncased model which means it is trained on lowercased words (if you open vocab.txt you can see that all words in this file are in lowercase). So this model can't recognize any capital letter, that's why it shows [UNK] for unknown words.

If you want a case sensitive model, you should use a model that is trained for this purpose. Usually the word "cased" is written at the end of model name. You can find lots of models at Huggingface.

Masoud Gheisari
  • 949
  • 1
  • 9
  • 21