dask histogram from zarr file (a big zarr file)

Question

So heres my question, I have a big 3dim array which is 100GB in size as a #zarr file (the array is more than twice the size). I have tried using the histogram from #Dask to calculate but I get an error saying that it cant do it because the file has tuples within tuples. Im guessing thats the zarr file formate rather than anything else?

any thoughts?

edit: yes the bigger computer thing wouldnt actually work...

Im running a dask client on a single machine, it runsthe calculation but just gets stuck somewhere.

I just tried dask.map function across the file but when I plot it out I get something like this:

ValueError: setting an array element with a sequence.

heres a version of the script:

def histo(img):
    return da.histogram(img, bins=255, range=[0, 255])

histo_1 = da.map_blocks(histo, fimg)

I am actually going to try and use it out side of the map function. I wonder rather than the map funtion, does the windowing from map blocks, actually cause the issue. well, ill let you know if it is or now....

edit 2

So I tried to remove the map blocks function as suggested and this was my result:


[in] h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])

[in] bins
[out] array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
       121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131.,
       132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142.,
       143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153.,
       154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164.,
       165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175.,
       176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186.,
       187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197.,
       198., 199., 200., 201., 202., 203., 204., 205., 206., 207., 208.,
       209., 210., 211., 212., 213., 214., 215., 216., 217., 218., 219.,
       220., 221., 222., 223., 224., 225., 226., 227., 228., 229., 230.,
       231., 232., 233., 234., 235., 236., 237., 238., 239., 240., 241.,
       242., 243., 244., 245., 246., 247., 248., 249., 250., 251., 252.,
       253., 254., 255.])

[in] h.compute
[out] <bound method DaskMethodsMixin.compute of dask.array<sum-aggregate, shape=(255,), dtype=int64, chunksize=(255,), chunktype=numpy.ndarray>>

im going to try in another notebook and see if it still occurs.

edit 3

its the stranges thing, but if I just declare the variable h, it comes out as one small element from the dask array?

edit

Strange, if i call the xarray.hist or the da.hist function, they both fall over. If I use the skimage.exposure.histogram it works but it appears that the zarr file is unpacked before the histogram is a calculated. Which is a bit of a problem...

Update 7th June 2020 (with a solution for not big but annoyingly medium data) see below for answer.

mdurant · Answer 1 · 2020-01-28T18:27:03.097

3

You probably want to use dask's function for this rather than map_blocks. For the latter, Dask expects the output of each call to be the same size as the input block, or a shape derived from the input block, instead of the one-dimensional fixed-size output of histogram.

h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
h.compute()

edited Jan 28 '20 at 18:27

answered Jan 28 '20 at 17:56

mdurant

27,272
5
45
74

So, I thought the same thing and tried that again without the whole map_blocks but the h only point to one element. Then when I try compute i get the following output: `>` – Ouetis_Khan Jan 28 '20 at 18:06
You would do well to show exactly what code you ran – mdurant Jan 28 '20 at 18:13
sorry adding code here isnt easy. ill stick it up in the main comment – Ouetis_Khan Jan 28 '20 at 18:14
That is exactly what you should do – mdurant Jan 28 '20 at 18:18
What do you do when you find a method or function? Maybe call it! – mdurant Jan 28 '20 at 18:27
If im honest im not the greatest of coders, so it's still a very new world for me. Any help in simplifying what im missing would help. I've decided to give up on the zarr format and just work in the original tiff file formats. Much larger on disc than zarr but im more likely to be able to read data i think. – Ouetis_Khan Jan 29 '20 at 09:45
No, unfortunately, the whole thing just crashes. So i dont get any result out. Im rewiting my whole pipeline to see where I may have gone wrong. – Ouetis_Khan Jan 30 '20 at 12:26
I also realise the reason my method was bound...i was pointing to a future result and hadnt actually called result! I did still have the problem of the whole thing crashing with map blocks. I understand now based on your advice that mapping causes a huge over head of additional results! – Ouetis_Khan Jun 09 '20 at 09:44

score 2 · Answer 2 · answered Jun 07 '20 at 14:49

Update 7th June 2020 (with a solution for not big but annoyingly medium data):

So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following:

1) if just using x.compute, the memory would very quickly fill up.

2) Using distributed would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because...it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk... that would all need to be merged at some point.

3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn't compute a histogram when i improved the chunking.

In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it's about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:

imgs = da.from_zarr(.....)

imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim

h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256 

h_out = da.to_zarr(h, "histogram.zarr")

I ran the progress bar alongside the process and to get a histogram from file took :

[########################################] | 100% Completed | 18min 47.3s

Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well, thanks for the help earlier in the year @mdurant.

dask histogram from zarr file (a big zarr file)

2 Answers2