Loading local tokenizer

Question

I'm trying to load a local tokenizer using;

from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained(r'file path\tokenizer')

however, this gives me the following error;

OSError: Can't load tokenizer for 'file path\tokenizer'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'file path\tokenizer' is the correct path to a directory containing all relevant files for a RobertaTokenizerFast tokenizer.

The file directory contains both a merges.txt and vocab.json file for the tokenizer, so I am not sure to how to resolve the issue.

Appreciate any help!

doine · Answer 1 · 2023-06-03T09:42:09.493

1

You need to point to the directory that contains those files, not one of those specific files. So get the path to that directory and define the tokenizer as you've done above, but with the path between '' in place of r'file path\tokenizer'.

tokenizer = RobertaTokenizerFast.from_pretrained('path_to_directory')

RobertaTokenizerFast expects to find vocab.json, merges.txt, and tokenizer.json in that directory, so make sure you have downloaded everything it requires. Note that you may also individually point to these files by passing the arguments vocab_file, merges_file, and tokenizer_file. See the docs for further information.

edited Jun 03 '23 at 09:42

answered Jun 03 '23 at 09:06

doine

336
1
12

Thanks, the path to the file directory is fine, however I only have vocab and merges files. These were the result of saving a ByteLevelBPETokenizer in an earlier step. `from tokenizers import ByteLevelBPETokenizer tokenizer = ByteLevelBPETokenizer() tokenizer.train(files=paths, vocab_size=30_522, min_frequency=2, special_tokens=[ '~~', '', '~~', '', '' ])` – Jon Jun 03 '23 at 12:41
I think you may just need to point to the files directly then. `tokenizer = RobertaTokenizerFast.from_pretrained('path_to_directory/vocab.json', 'path_to_directory/merges.txt')` – doine Jun 03 '23 at 17:05
Otherwise save the `ByteLevelBPETokenizer` explicitly with `save_pretrained()`, and point to the files in that directory. – doine Jun 03 '23 at 17:16

Loading local tokenizer

1 Answers1