I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works.
import pandas as pd
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from tokenizers.pre_tokenizers import Whitespace
import os
tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordLevelTrainer(vocab_size=1400, special_tokens=special_tokens)
tokenizer.train(files=["./data/material.txt"], trainer=trainer)
# 最终得到该语料的Tonkernize,查看下词汇大小
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
# 保存训练的tokenizer
tokenizer.model.save('./my_word2_token/')
But when I try to use BartTokenizer or BertTokenizer to load my vocab.json
, it does not work. Especially, in terms of BertTokenizer, the tokenized result are all [UNK], as below.
As for BartTokenizer, it errors as
ValueError: Calling BartTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.
Could anyone help me out?
I would like to use WordLevel encoding method to establish my own wordlists and tokenize them using WordLevel encoding but not BEP encoding