1

I have saved many data files as .npz to save space on storage (savez_compressed). Each file is saved as one array, so when using the numpy load function, it returns the key to a dictionary that contains the array.

How can you store this array as an array instead of a dictionary quickly.

For example:

data = []
datum = np.load('file.npz')
key = datum.keys()[0]
data.append([datum[key]])

When profiling this, my code spent most of the time using the get method for the dictionary.

If it was saved as a .npz file, it doesn't need the get method and is much faster.

data = []
data.append([np.load('file.npz')])

I thought by loading the file, the data is already in memory in both cases. The savez_compressed doesn’t seem to have an option to save as just an array. Is this possible or is there a way to speed up the loading?

DV82XL
  • 5,350
  • 5
  • 30
  • 59
user-2147482637
  • 2,115
  • 8
  • 35
  • 56

1 Answers1

5

np.load uses a np.lib.npyio.NpzFile class to load npz files. Its doc is:

NpzFile(fid)

A dictionary-like object with lazy-loading of files in the zipped
archive provided on construction.

`NpzFile` is used to load files in the NumPy ``.npz`` data archive
format. It assumes that files in the archive have a ".npy" extension,
other files are ignored.

The arrays and file strings are lazily loaded on either
getitem access using ``obj['key']`` or attribute lookup using
``obj.f.key``. A list of all files (without ".npy" extensions) can
be obtained with ``obj.files`` and the ZipFile object itself using
``obj.zip``.

I think the last paragraph answers your timeing question. The data is not loaded until you do the dictionary get. So it isn't just a in-memory dictionary lookup - it's a file load (with uncompress).

Python dictionary lookups are fast - the interpreter is doing all the time when accessing attributes of objects. And when simply managing the namespace.

hpaulj
  • 221,503
  • 14
  • 230
  • 353