6

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.
I am successful in downloading and running them. But if I try to save them and load again, then some error occurs.
If I use AutoTokenizer.from_pretrained to download a tokenizer, then it works.

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
        text = "Hello there"
        enc = tokenizer.encode_plus(text)
        enc.keys()

Out[1]: dict_keys(['input_ids', 'attention_mask'])

But if I save it using tokenizer.save_pretrained("distilroberta-tokenizer") and try to load it locally, then it fails.

[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    238                 resume_download=resume_download,
--> 239                 local_files_only=local_files_only,
    240             )

/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
    266         # File, but it doesn't exist.
--> 267         raise EnvironmentError("file {} not found".format(url_or_filename))
    268     else:

OSError: file distilroberta-tokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    193         config = kwargs.pop("config", None)
    194         if not isinstance(config, PretrainedConfig):
--> 195             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    196 
    197         if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    194 
    195         """
--> 196         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
    197 
    198         if "model_type" in config_dict:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    250                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
    251             )
--> 252             raise EnvironmentError(msg)
    253 
    254         except json.JSONDecodeError:

OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:

- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file

Its saying 'config.josn' is missing form the directory. On checking the directory, I am getting list of these files:

[3]:    !ls distilroberta-tokenizer

Out[3]: merges.txt  special_tokens_map.json  tokenizer_config.json  vocab.json

I know this problem has been posted earlier but none of them seems to work. I have also tried to follow the docs but still can't make it work.
Any help would be appreciated.

ferty567
  • 61
  • 1
  • 1
  • 3

2 Answers2

6

There is currently an issue under investigation which only affects the AutoTokenizers but not the underlying tokenizers like (RobertaTokenizer). For example the following should work:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('YOURPATH')

To work with the AutoTokenizer you also need to save the config to load it offline:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')

tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')

tokenizer = AutoTokenizer.from_pretrained('YOURPATH')

I recommend to either use a different path for the tokenizers and the model or to keep the config.json of your model because some modifications you apply to your model will be stored in the config.json which is created during model.save_pretrained() and will be overwritten when you save the tokenizer as described above after your model (i.e. you won't be able to load your modified model with tokenizer config.json).

cronoik
  • 15,434
  • 3
  • 40
  • 78
  • This works, but I have one more question. While using tokenizer_obj.save_pretrianed("path"), in the log it is showing that it saved five files. 1. tokenizer_config.json, 2. special_tokens_map.json, 3. vocab.txt, 4. added_tokens.json, 5. tokenizer.json. However added_token.json is missing in the location. If you can point me somewhere to find any documentation on save_pretrained() for tokenizers, I can only find for models. – Srikar Manthatti Sep 28 '21 at 23:08
  • 2
    This is the [link](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained) to the documentation. The added_tokens.json is only saved when you have added tokens manually. I assume they are not checking what they are doing and just print what files could be created. @SrikarManthatti – cronoik Sep 29 '21 at 05:36
  • this seems close but still doesn't work if I take saved model and tokenizer and instantiate it in a huggingface pipeline, like this: `pipe = pipeline("ner", model, token)` --- I still get `Exception: Impossible to guess which tokenizer to use.` – Marc Maxmeister Sep 09 '22 at 16:28
  • Oh! I figured it out. I need to say `pipe = pipeline("ner", model=model, tokenizer=tokenizer)` for it to work. The `impossible to guess` error is misleading here. – Marc Maxmeister Sep 09 '22 at 17:22
5

I see several issues in your code which I listed below:

  1. distilroberta-tokenizer is a directory containing the vocab config, etc files. Please make sure to create this dir first.

  2. Using AutoTokenizer works if this dir contains config.json and NOT tokenizer_config.json. So, please rename this file.

I modified your code below and it works.

dir_name = "distilroberta-tokenizer"

if os.path.isdir(dir_name) == False:
    os.mkdir(dir_name)  

tokenizer.save_pretrained(dir_name)

#Rename config file now

#tmp = AutoTokenizer.from_pretrained(dir_name)   

I hope this helps!

Thanks!

user12769533
  • 258
  • 2
  • 6