2

This works to write and load a numpy array + metadata in a .npz compressed file (here the compression is useless because it's random, but anyway):

import numpy as np

# save
D = {"x": np.random.random((10000, 1000)), "metadata": {"date": "20221123", "user": "bob", "name": "abc"}}
with open("test.npz", "wb") as f:
    np.savez_compressed(f, **D)

# load
D2 = np.load("test.npz", allow_pickle=True)
print(D2["x"])
print(D2["metadata"].item()["date"])

Let's say we want to change only a metadata:

D["metadata"]["name"] = "xyz"

Is there a way to re-write to disk in test.npz only D["metadata"] and not the whole file because D["x"] has not changed?

In my case, the .npz file can be 100 MB to 4 GB large, that's why it would be interesting to rewrite only the metadata.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    It should be possible. That npz file would be an archive with two files inside: `x.npy` and `metadata.npy`. With python's `zipfile` builtin, maybe we can open the archive's specific subfile and modify it somehow. – Mercury Nov 23 '22 at 11:39
  • Interesting solution @Mercury. Do you think there is high-level API to do this, or should we do this manually with `zipfile`? – Basj Nov 23 '22 at 11:50
  • 1
    The problem you have is very intuitive and `np.savez` and `np.load` do extensively use `zipfile` already, so ideally we *shouldn't* need the lower level library. In fact, the object you have after loading, `D2`, is an `NpzFile` object. While there is no direct page on this on the numpy docs, I can see from [here](https://stackoverflow.com/questions/34119752/querying-a-numpy-array-of-numpy-arrays-saved-as-an-npz-is-slow/34119852#34119852) + `help(NpzFile)` that `np.load` is lazy and doesn't actually load everything in memory. This makes a high level solution possible; let me run a few checks. – Mercury Nov 23 '22 at 12:28
  • Do a simple `np.save('metadata.npy', {"date": "20221123", "user": "bob", "name": "abc"}, allow_pickle=True)`, and then try to open that file with your favorite text editor. You'll see that finding the 'abc' string is not trivial, much less changing it to 'xyz'. – hpaulj Nov 23 '22 at 17:00

1 Answers1

1

Ultimately the solution that I could get to work (thus far) is the one I originally thought of with zipfile.

import zipfile
import os
from contextlib import contextmanager

@contextmanager
def archive_manager(archive_name: str, key: str):
    f, s = zipfile.ZipFile(archive_name, "a"), f"{key}.npy"

    yield s

    f.write(s)
    f.close()
    os.remove(s)

Let's say we want to change metadata:

new_metadata = {"date": "20221123", "user": "bob", "name": "xyz"}

with archive_manager("test.npz", "metadata") as archive:
    np.save(archive, new_metadata)

np.load returns an NpzFile, which is a lazy loader. However, NpzFile objects aren't directly writeable. We cannot also do something like D["metadata"] = new_metadata until D has been converted to a dict, and that loses the lazy functionality.

Mercury
  • 3,417
  • 1
  • 10
  • 35
  • Thanks @Mercury. Just to be sure, why is there a `os.remove(...)`? Does it involve creating and deleting a temp file? If so, could we use `io.bytesIO` or `io.StringIO` or https://docs.python.org/3/library/tempfile.html? – Basj Nov 23 '22 at 14:50
  • PS: Isn't there a method in zipfile to edit an existing file inside a zip? – Basj Nov 23 '22 at 14:54
  • If you open a file in `append` mode, you can write new values to it, and even overwrite existing bytes if you use `seek`. But an `.npy` file is binary (not text). As your use of `allow_pickle` and `item` show, your `metadata` `dict` has been "pickled" (converted to a binary string), and saved as an element of an object dtype array. That's way too many layers to go through to "edit in-place". Mercury's idea of writing a new `npy` file to the archive is the best we can do. – hpaulj Nov 23 '22 at 16:54