What is the format of files in `pip-cache/pip/html`?

Question

I'm trying to restore some hard to get dependencies from the pip cache. All the files stored in the mentioned directory start with bytes cc=2. I would expect something like .tar.gz files there, but nope.

I briefly tried to search through pip source code, but mostly found code dealing with Wheel archives (which are essentially .zip files). I failed to find the code handling the /http/ subfolder. Not sure if I was looking at the wrong part of the source tree, or wrong pip version.

See https://stackoverflow.com/a/59240899/7976758 Found in https://stackoverflow.com/search?q=%5Bpip%5D+cache+files+format — phd, Jun 18 '20 at 12:02
@phd a great find .. if you bother writing an answer with https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/serialize.py#L156, I'll happily accept it;) — liborm, Jun 18 '20 at 12:41

score 1 · Accepted Answer · edited Jun 18 '20 at 14:25

pip uses cachecontrol to load/store its cache files. The filename is the sha224 of the URL being requested.

The code to load the files can be seen at Gihub.

For the particular case of cc=2, the data after cc=2, (note the comma) can be decoded by

cached = json.loads(zlib.decompress(data).decode("utf8"))

# We need to decode the items that we've base64 encoded
cached["response"]["body"] = _b64_decode_bytes(cached["response"]["body"])
cached["response"]["headers"] = dict(
    (_b64_decode_str(k), _b64_decode_str(v))
    for k, v in cached["response"]["headers"].items()
)
cached["response"]["reason"] = _b64_decode_str(cached["response"]["reason"])
cached["vary"] = dict(
    (_b64_decode_str(k), _b64_decode_str(v) if v is not None else v)
    for k, v in cached["vary"].items()
)

Just for my future reference: unwrap all data to `unpacked`: `find var/pip-cache/pip/http -type f | parallel "<{} tail -c+6 | pigz -z -d | jq -r .response.body | base64 -d > unpacked/{/}"`. — liborm, Jul 10 '20 at 19:24

What is the format of files in `pip-cache/pip/html`?

1 Answers1