3

t is a dask array. I'd like to plot a histogram of t. Dask documentation has method

dask.array.histogram(a, bins=None, range=None, normed=False, weights=None, density=None)

but no example. I've tried setting bins with a numpy array. Didn't work. I've tried using matplotlib.pyplot and it took more than 5 minutes and hasn't produced anything (my dataset is quite large (GB sized) but that seems like a really long time).

Sander van den Oord
  • 10,986
  • 5
  • 51
  • 96
tnabdb
  • 517
  • 2
  • 8
  • 22

2 Answers2

2

Dask.array.histogram requires both bins and range to be set with the number of desired bins and the min/max range of the data respectively. Here is a quick example:

In [1]: import dask.array as da

In [2]: x = da.random.normal(10, 0.1, size=(100000,), chunks=(1000,))  # random dataset 

In [3]: h, bins = da.histogram(x, bins=100, range=[9, 11])

In [4]: bins
Out[4]: 
array([  9.  ,   9.02,   9.04,   9.06,   9.08,   9.1 ,   9.12,   9.14,
         9.16,   9.18,   9.2 ,   9.22,   9.24,   9.26,   9.28,   9.3 ,
         9.32,   9.34,   9.36,   9.38,   9.4 ,   9.42,   9.44,   9.46,
         9.48,   9.5 ,   9.52,   9.54,   9.56,   9.58,   9.6 ,   9.62,
         9.64,   9.66,   9.68,   9.7 ,   9.72,   9.74,   9.76,   9.78,
         9.8 ,   9.82,   9.84,   9.86,   9.88,   9.9 ,   9.92,   9.94,
         9.96,   9.98,  10.  ,  10.02,  10.04,  10.06,  10.08,  10.1 ,
        10.12,  10.14,  10.16,  10.18,  10.2 ,  10.22,  10.24,  10.26,
        10.28,  10.3 ,  10.32,  10.34,  10.36,  10.38,  10.4 ,  10.42,
        10.44,  10.46,  10.48,  10.5 ,  10.52,  10.54,  10.56,  10.58,
        10.6 ,  10.62,  10.64,  10.66,  10.68,  10.7 ,  10.72,  10.74,
        10.76,  10.78,  10.8 ,  10.82,  10.84,  10.86,  10.88,  10.9 ,
        10.92,  10.94,  10.96,  10.98,  11.  ])

In [5]: h.compute()
Out[5]: 
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    1,    1,    4,   15,
         19,   71,  132,  231,  376,  604,  891, 1307, 1884, 2635, 3422,
       4276, 5455, 6158, 7092, 7759, 7933, 7994, 7625, 6994, 6194, 5315,
       4272, 3381, 2529, 1803, 1324,  912,  594,  331,  225,  127,   54,
         32,   12,   10,    2,    2,    1,    1,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0])
MRocklin
  • 55,641
  • 23
  • 163
  • 235
2

The library hvplot (link) enables drawing histogram on Dask DataFrame. Here is an example.

Following is a pseudo code. dd is a Dask DataFrame and histogram is plotted for the feature with name feature_one

import hvplot.dask

dd.hvplot.hist(y="feature_one")

The library is recommended to be installed using conda:

conda install -c conda-forge hvplot
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60