4

I have a zipfile which contains many npy files (file1.npy, file2.npy, file3.npy, ...). I would like to load them individually without extracting the zipfile on a filesystem. I have tried many things but I can't figure it out.

My guess was:

import zipfile
import numpy as np

a = {}

with zipfile.ZipFile('myfiles.zip') as zipper:
    for p in zipper.namelist():
        with zipper.read(p) as f:
            a[p] = np.load(f)

Any ideas?

Dharman
  • 30,962
  • 25
  • 85
  • 135
Sigmun
  • 1,002
  • 2
  • 12
  • 23
  • 3
    What is your error? Why isn't it working – pppery May 02 '16 at 13:13
  • Instead of having a zip of many `*.npy`, you could use [savez_compressed](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.savez_compressed.html) to save them all into a single `*.npz` which then you don't need to manually unzip it. – kennytm May 02 '16 at 13:34
  • @kennytm I don't have access to the way the files are saved – Sigmun May 02 '16 at 15:08
  • I wonder if renaming the file to `*.npz` would fool `np.load` into treating it as a `savez` produced archive. Or use `np.lib.npyio.NpzFile` directly. – hpaulj May 02 '16 at 16:16
  • @hpaulj What I don't understand in your wondering is that I have a zipfile that already contains many npy files... So how can I try your idea ? Can you write an full answer ? – Sigmun May 03 '16 at 06:52
  • I just tested `load` on an `zip` archive - it works even if I didn't use `np.savez`. – hpaulj May 03 '16 at 07:01
  • For those interested a nice read on the [npy format](https://docs.scipy.org/doc/numpy/neps/npy-format.html). Particularly: "For a simple way to combine multiple arrays into a single file, one can use ZipFile to contain multiple ”.npy” files. We recommend using the file extension ”.npz” for these archives." – AnnanFay Nov 26 '17 at 18:12

3 Answers3

5

Save 2 arrays, each to their own file:

In [452]: np.save('x.npy',x)
In [453]: np.save('y.npy',y)

With a file browser tool, create a zip file, and try to load it:

In [454]: np.load('xy.zip')
Out[454]: <numpy.lib.npyio.NpzFile at 0xb48968ec>

Looks like np.load detected the zip nature (independent of the name), and returned a NpzFile object. Let's assign it to a variable, and try the normal .npz extract:

In [455]: xy=np.load('xy.zip')

In [456]: xy['x']
Out[456]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [457]: xy['y']
Out[457]: 
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

So load can perform the lazy load on any zip file of npy files, regardless of how it's created.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Very nice answer ! – Sigmun May 03 '16 at 09:08
  • For those interested a nice read on the [npy format](https://docs.scipy.org/doc/numpy/neps/npy-format.html). Particularly: "For a simple way to combine multiple arrays into a single file, one can use ZipFile to contain multiple ”.npy” files. We recommend using the file extension ”.npz” for these archives." – AnnanFay Nov 26 '17 at 18:09
1

Does the numpy function expect a file object, not the resulting text. For zip files, I generally do something like:

with ZipFile(path, mode='r') as archive:
    with io.BufferedReader(archive.open(filename, mode='r')) as file:

I am guessing you should pass zipper.open(p, mode='r') into np.load(f). Also, I strong urge you not to do zipper.read(p) since it will read the whole file in memory at once. So, using your code conventions, try:

with zipfile.ZipFile('myfiles.zip') as zipper:
    for p in zipper.namelist():
        with io.BufferedReader(zipper.open(p, mode='r')) as f:
            a[p] = np.load(f)
Sigmun
  • 1,002
  • 2
  • 12
  • 23
  • the `zipper.open(p,mode='r')` command gives me the following error : Traceback (most recent call last): File "", line 4, in File "/usr/local/python/lib/python2.7/site-packages/numpy/lib/npyio.py", line 379, in load fid.seek(-N, 1) # back-up io.UnsupportedOperation: seek – Sigmun May 02 '16 at 15:09
  • I edit your answer so now the both examples are working – Sigmun May 02 '16 at 15:38
0

I replace load with BytesIO. I do not know if it is efficient, but works and is more readable :)

with ZipFile(fname) as z:
    for p in zipper.namelist():
        tmp =  np.load(io.BytesIO(z.read(p)))
Theis Jendal
  • 71
  • 1
  • 6