59

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.

Ivan Lee
  • 3,420
  • 4
  • 30
  • 45

4 Answers4

86

You can specify the cache directory everytime you load a model with .from_pretrained by the setting the parameter cache_dir. You can define a default location by exporting an environment variable TRANSFORMERS_CACHE everytime before you use (i.e. before importing it!) the library).

Example for python:

import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'

Example for bash:

export TRANSFORMERS_CACHE=/blabla/cache/
cronoik
  • 15,434
  • 3
  • 40
  • 78
  • 10
    The "before importing the module" saved me for a related problem using flair, prompting me to import flair after changing the huggingface cache env variable. – Orysza Mar 23 '21 at 13:54
  • 4
    In addition, the environment variable for the datasets cache is `HF_HOME`. https://github.com/huggingface/transformers/issues/8703 – ezChx Sep 24 '21 at 18:02
  • first run below command on linux terminal then run first command in python code – Ahwar Nov 03 '22 at 17:26
  • 1
    @Ahwar you don't need both, one of them is enough. – cronoik Mar 08 '23 at 22:10
46

As @cronoik mentioned, alternative to modify the cache path in the terminal, you can modify the cache directory directly in your code. I will just provide you with the actual code if you are having any difficulty looking it up on HuggingFace:

tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")

model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
aysljc
  • 788
  • 8
  • 7
26

You'll probably want to set the HF_HOME environment variable:

export HF_HOME=/path/to/cache/directory

This is because besides the model cache of HF Transformers itself, there are cache directories of other Hugging Face libraries that also eat space in the home directory. The previous answers and comments did not make this clear.

In addition, it may make sense to set a symlink to catch cases where the environment variable is not set (you may have to move away the directory ~/.cache/huggingface before, if it exists):

ln -s /path/to/cache/directory ~/.cache/huggingface

In particular, the HF_HOME environment variable is also respected by Hugging Face datasets library, although the documentation does not explicitly state this.

The Transformers documentation describes how the default cache directory is determined:

Cache setup

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

  1. Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
  2. Shell environment variable: HF_HOME.
  3. Shell environment variable: XDG_CACHE_HOME + /huggingface.

What this piece of documentation doesn't explicitly mention is that HF_HOME defaults to $XDG_CACHE_HOME/huggingface and is used for other huggingface caches, e.g. the datasets cache, which is separate from the transformers cache. The value of XDG_CACHE_HOME is machine dependent, but usually it is ~/.cache (and HF defaults to this value if XDG_CACHE_HOME is not set) - thus the usual default ~/.cache/huggingface

Bernhard Stadler
  • 1,725
  • 14
  • 24
  • this worked for me! after setting only TRANSFORMERS_CACHE still, the cache was saved on home dir but after the three settings, it worked! – Rohola Zandie Sep 12 '22 at 23:30
  • 1
    TRANSFORMERS_CACHE only controls the Hugging Face Transformers cache, i.e. for model checkpoints. Other HF libraries like [datasets](https://github.com/huggingface/datasets/blob/main/src/datasets/config.py#L147), [evaluate](https://github.com/huggingface/evaluate/blob/main/src/evaluate/config.py#L115), [hub](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/constants.py#L73) and [autotrain](https://github.com/huggingface/autotrain-advanced/blob/main/src/autotrain/dataset.py#L141) have cache directories that are influenced by HF_HOME, but not by TRANSFORMERS_CACHE. – Bernhard Stadler Apr 19 '23 at 17:32
  • I linked to the main branch in my previous comment, so the content changes. For future reference: [datasets](https://github.com/huggingface/datasets/blob/f423b6/src/datasets/config.py#L147), [evaluate](https://github.com/huggingface/evaluate/blob/d48669/src/evaluate/config.py#L115), [hub](https://github.com/huggingface/huggingface_hub/blob/25e20e/src/huggingface_hub/constants.py#L73) and [autotrain](https://github.com/huggingface/autotrain-advanced/blob/0ff192/src/autotrain/dataset.py#L142) – Bernhard Stadler May 25 '23 at 06:29
3

Typically, you want to keep datasets and model caches around for longer but other things not. Also, these things are large and you may not want in your home folder.

So, let's say you create directory \my_drive\hf where you want HuggingFace to cache everything. You can create following environment variables:

export HF_HOME=\my_drive\hf\misc
export HF_DATASETS_CACHE=\my_drive\hf\datasets
export TRANSFORMERS_CACHE=\my_drive\hf\models

Now you can clean out non essential things more easily.

Note that HF_HOME is basically cache location of all things on Hub but above you separate out datasets and models cache. The XDG_CACHE_HOME is not used if HF_HOME is set. If it wasn't set as above then HF_HOME defaults to $XDG_CACHE_HOME/huggingface.

More info: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables

Shital Shah
  • 63,284
  • 17
  • 238
  • 185