0

I am trying to use a tokenizer from huggingface. However, I do not have the vocab.

from tokenizers import BertWordPieceTokenizer , CharBPETokenizer, ByteLevelBPETokenizer
from tokenizers import Tokenizer

text = 'the quick brown fox jumped over the lazy dog !!!'
tokenizer = CharBPETokenizer()
print(tokenizer)
#Tokenizer(vocabulary_size=0, model=BPE, unk_token=<unk>, suffix=</w>, dropout=None, #lowercase=False, unicode_normalizer=None, bert_normalizer=True, #split_on_whitespace_only=False)

tokenizer = Tokenizer(BPE())
out = tokenizer.encode(text)
out.tokens
Out[33]: []

According to https://github.com/huggingface/tokenizers/blob/main/bindings/python/py_src/tokenizers/implementations/char_level_bpe.py , without vocab this should just use Tokenizer(BPE()) .

I think it might be a lack of vocab issue. Can someone point me where to get default vocab for BertWordPieceTokenizer , CharBPETokenizer, ByteLevelBPETokenizer , SentencePieceUnigramTokenizer and BaseTokenizer.

Palash Jhamb
  • 605
  • 6
  • 15
  • Can you clarify what your expected tokenization output should look like? As for the mentioned classes, there is no "default vocabulary", unless you are loading from a pre-trained model, in which case they refer to generated BPE vocabularies. – dennlinger Jun 15 '22 at 13:54
  • Thanks. I loaded vocab from other models and it worked. I thought they might have a default vocabulary. – Palash Jhamb Jun 23 '22 at 14:56
  • It would be great if you could self-answer this post then with a solution on how to load the actual model-specific vocabularies. – dennlinger Jun 24 '22 at 09:37

0 Answers0