0

I am searching for a thread-safe alternative to hdf5 to read from in a multiprocessing environment and stumbled across zarr, which according to benchmarks is said to be basically a drop-in replacement for h5py in a python envrionment.

I tried it and all looks good so far, but I cannot wrap my head around the number of files zarr outputs.

If I write to an h5-file with h5py only one file results whereas zarr seems to output a random number of files within a subfolder.

Would someone explain to me why that is and what the exact number of created files depends on?

thanks in advance

CD86
  • 979
  • 10
  • 27

1 Answers1

1

Zarr generally maps keys (particular chunk indices) to values (binary blobs) representing that chunk's data. If you are using the DirectoryStore, this results in a number of different files being written to disk. The number of files seen will be dependent on how many chunks your arrays have and which ones contain non-trivial content (like non-zero values).

jakirkham
  • 685
  • 5
  • 18
  • thank you for your reply! So this is basically the default behaviour and I have to deal with it? – CD86 Apr 18 '19 at 13:52
  • It's a consequence of how Zarr works under the hood. There are other stores that could be used if one wants a single file. Like using LMDB. Referenced a list of all of them below. ref: https://zarr.readthedocs.io/en/stable/api/storage.html – jakirkham Apr 19 '19 at 03:56