3

I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works.

import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

            
tokenizer.train(files=["./data/material.txt"], trainer=trainer)
# 最终得到该语料的Tonkernize,查看下词汇大小
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
# 保存训练的tokenizer
tokenizer.model.save('./my_word2_token/')

But when I try to use BartTokenizer or BertTokenizer to load my vocab.json, it does not work. Especially, in terms of BertTokenizer, the tokenized result are all [UNK], as below. UNK pic As for BartTokenizer, it errors as

ValueError: Calling BartTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

Could anyone help me out?

I would like to use WordLevel encoding method to establish my own wordlists and tokenize them using WordLevel encoding but not BEP encoding

cronoik
  • 15,434
  • 3
  • 40
  • 78
VictorZhu
  • 31
  • 2

1 Answers1

2

BartTokenizer and BertTokenizer are classes of the transformer library and you can't directly load the tokenizer you generated with it. The transformer library offers you a wrapper called PreTrainedTokenizerFast to load it:

from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ['[UNK]', "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)

            
tokenizer.train(files=["material.txt"], trainer=trainer)

from transformers import PreTrainedTokenizerFast

transformer_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(transformer_tokenizer("马云 祖籍浙江嵊县,生于浙江杭州,中国大陆企业家,中国共产党党员。").input_ids)

Output:

[0, 0, 0, 0, 0, 261, 0, 0, 0, 56, 0, 0, 261, 0, 221, 0, 345, 133, 28, 0, 357, 0, 448, 0, 345, 133, 127, 0, 377, 377, 0,5]

P.S.: Please note that I added the unk parameter to:

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
cronoik
  • 15,434
  • 3
  • 40
  • 78