How to download the pretrained dataset of huggingface RagRetriever to a custom directory

Question

I'm playing with a RAG example from facebook (huggingface) https://huggingface.co/facebook/rag-token-nq#usage.

Here a very nice explanation of it: https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/

The code is very simple but the dataset it downloads in this step is a little big (75GB):

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)

It downloads the dataset in /root/.cache/huggingface/datasets/, something that I'd like to change if possible. This is the output of that line of code is:

Downloading and preparing dataset wiki_dpr/psgs_w100.nq.no_index (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/wiki_dpr/

My question is: how I can change the folder to where download the dataset used by RagRetriever.from_pretrained (the 75GB one) to another one different to root/.cache/huggingface/datasets/ .

Thanks!.

Unfortunately is not working. it continues downloading the dataset to /root/.cache. It doesn't matter if I use "os.environ['TRANSFORMERS_CACHE']" or "chace_dir" as argument. in from_pretrained. The strange thing is that it works fine with the RagTokenizer but no with the RagRetriever which seems to be a bug.... Thanks for the lead, anyway. It was very informative. — JoseM LM, Oct 05 '20 at 14:10
A workaround for this (while it is fixed), was using a symlink. — JoseM LM, Oct 05 '20 at 15:30
They are already working on this [issue](https://github.com/huggingface/transformers/issues/7583). Should be fixed with the next version. — cronoik, Oct 06 '20 at 04:16

How to download the pretrained dataset of huggingface RagRetriever to a custom directory

0 Answers0