4

I want to train an XLNET language model from scratch. First, I have trained a tokenizer as follows:

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files='data.txt', min_frequency=2, special_tokens=[ #defualt vocab size
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])
tokenizer.save_model("tokenizer model")

Finally, I will have two files in the given directory:

merges.txt
vocab.json

I have defined the following config for the model:

from transformers import XLNetConfig, XLNetModel
config = XLNetConfig()

Now, I want to recreate my tokenizer in transformers:

from transformers import XLNetTokenizerFast

tokenizer = XLNetTokenizerFast.from_pretrained("tokenizer model")

However, the following error appears:

File "dfgd.py", line 8, in <module>
    tokenizer = XLNetTokenizerFast.from_pretrained("tokenizer model")
  File "C:\Users\DSP\AppData\Roaming\Python\Python37\site-packages\transformers\tokenization_utils_base.py", line 1777, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'tokenizer model'. Make sure that:

- 'tokenizer model' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'tokenizer model' is the correct path to a directory containing relevant tokenizer files

What should I do?

  • Is the `tokenizer model` just a replacement for the full path? – cronoik Feb 20 '21 at 15:58
  • pretrained_model_name_or_path (`str` or `os.PathLike`, optional), [here](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained) – Shijith Feb 20 '21 at 16:06
  • The tokenizer model is a replacement for the full path of the folder in which the two files are saved. –  Feb 20 '21 at 16:11
  • When this folder only contains those two files, you can not use the `from_pretrained` method as it requires a `tokenizer_config.json`. Add this and it will work directly. @BNoor – cronoik Feb 21 '21 at 07:40

1 Answers1

2

Instead of

tokenizer = XLNetTokenizerFast.from_pretrained("tokenizer model")

you should write:

from tokenizers.implementations import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer(
    "tokenizer model/vocab.json",
    "tokenizer model/merges.txt",
)