How many files does zarr generate?

Question

I am searching for a thread-safe alternative to hdf5 to read from in a multiprocessing environment and stumbled across zarr, which according to benchmarks is said to be basically a drop-in replacement for h5py in a python envrionment.

I tried it and all looks good so far, but I cannot wrap my head around the number of files zarr outputs.

If I write to an h5-file with h5py only one file results whereas zarr seems to output a random number of files within a subfolder.

Would someone explain to me why that is and what the exact number of created files depends on?

thanks in advance

score 1 · Accepted Answer · answered Apr 17 '19 at 15:55

1

Zarr generally maps keys (particular chunk indices) to values (binary blobs) representing that chunk's data. If you are using the DirectoryStore, this results in a number of different files being written to disk. The number of files seen will be dependent on how many chunks your arrays have and which ones contain non-trivial content (like non-zero values).

answered Apr 17 '19 at 15:55

jakirkham

685
5
18

thank you for your reply! So this is basically the default behaviour and I have to deal with it? – CD86 Apr 18 '19 at 13:52
It's a consequence of how Zarr works under the hood. There are other stores that could be used if one wants a single file. Like using LMDB. Referenced a list of all of them below. ref: https://zarr.readthedocs.io/en/stable/api/storage.html – jakirkham Apr 19 '19 at 03:56

How many files does zarr generate?

1 Answers1