0

Not sure if this question makes sense/is relevant wrt zarr. I'm storing zarr data on disk in groups so for example I have

group = zarr.group()
d1 = group.create_dataset('baz', shape=100, chunks=10)
d2 = group.create_dataset('foo', shape=100, chunks=10)

Now group is iterable so I can iterate over it and read the data from all groups:

all_data = [group[g][:] for g in group]

Is there a way to read all of the data from groups using multithreading to speed it up? I know that within an array you can use multithreading to read and write data.

Assuming that reading the data by groups is too slow for me, should I put all of the groups into one data array container? I guess I'm wondering what the function of groups are, aside from an organizational container. Because assuming that each group contains similar data you could theoretically just add another axis to your numpy array (for the groups) and store all groups in one big array.

Michael
  • 7,087
  • 21
  • 52
  • 81

1 Answers1

1

Groups are primarily intended as an organisational container. A group can contain any number of arrays, where each array may have a different shape and/or data type, so they are a flexible way to organise data. If your arrays are all of the same shape and data type then you could, as you suggest, stack them all up into a single multidimensional array. However, I would not expect the read speed to be very different, whether you have multiple arrays in a group or have all data in a single array, if the total amount of data is the same.

If you want to read all arrays in a group into memory, and you are using the default compressor (Blosc), then this will already use multiple threads during decompression. Blosc usually does a good job of making use of available cores, so you may not be able to improve much if at all by adding any further multithreading.

Alistair Miles
  • 312
  • 1
  • 7