TL;DR for updated tiktoken & cl100k_base
Should work as of time of writing
- Download this file on your local machine
- Rename it to
9b5ad71b2ce5302211f9c61530b329a4922fc6a4
- Transfer to your remote machine in a folder called "tiktoken_cache"
- Run the following code every time you need to use tiktoken
import os
tiktoken_cache_dir = "path_to_tiktoken_cache_folder"
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir
# validate
assert os.path.exists(os.path.join(tiktoken_cache_dir,"9b5ad71b2ce5302211f9c61530b329a4922fc6a4"))
Just ran into this issue today as well. Not the exact same error, but solution for running this offline should be the same. We'll download the necessary file, then "trick" tiktoken into caching it.
This method works if, say you have a remote machine with no internet access and a local machine with internet.
I'm outlining a generalized version below, but you can skip to the tl;dr if you have an updated version of tiktoken and are using the cl100k_base
tokenizer.
Generalized Steps
Step 1: Getting the blob URL
First, let's grab the tokenizer blob URL from the source on your remote machine. If we trace the get_encoding
function, we find it calls a function from tiktoken_ext.openai_public
which has the blob URIs for each encoder. Identify the correct function, then print the source
import tiktoken_ext.openai_public
import inspect
print(dir(tiktoken_ext.openai_public))
# The encoder we want is cl100k_base, we see this as a possible function
print(inspect.getsource(tiktoken_ext.openai_public.cl100k_base))
# The URL should be in the 'load_tiktoken_bpe function call'
As of time of writing, it should be https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken for cl100k_base
Step 2: Downloading
Now, navigate to the blob URL on your local machine to download it.
Note: for old versions of tiktoken, Step 1 would have yielded you an Azure blob URI (like az://openaipublic/encodings/cl100k_base.tiktoken
); if this is the case, head to the latest source here and grab a non-azure link for downloading purposes only.
Step 3: Copy and rename file
Now, transfer the file to your remote machine to a new folder. Tracing the get_encoding function further reveals a call to tiktoken.load.read_file_cached()
which indicates the file needs to be renamed. To get the name for the file, run the following code (pulled from source):
import hashlib
blobpath = "your_blob_url_here"
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
print(cache_key)
Note: blobpath
is the blob URL/URI discovered in Step 1; if Step 1 had an az://
path, you are still using that one.
Rename the file on remote to the cache_key
Step 4: Set up the tiktoken cache
The read_file_cached
function then checks environment variables for a cache path and reads from there, so lets set that up:
import os
tiktoken_cache_dir = "path_to_folder_containing_tiktoken_file"
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir
# validate
assert os.path.exists(os.path.join(tiktoken_cache_dir, cache_key))
Note: this is not the full path to the tiktoken file, only the path to the folder containing the file
This code snippet will need to be run every time you need tiktoken.
Step 5: Use tiktoken
Congrats, now you can use tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
encoding.encode("Hello, world")