Fair and realistic comparison using dask

Question

In order to get a better idea of dask library in python I am trying to make a fair comparison between using dask and not. I used h5pyto create a big dataset which was used later on to measure mean in one of the axis as a numpy style operation.

I was wondering if what I did it is actually a fair comparison to check if dask can run code in parallel. I was reading the documentation of both, h5py and dask so I came up with this little experiment.

What I did so far was:

Create (write) a dataset using h5py. This was carried out by using maxshape and resize alternative to append data avoidind loading the whole date on memory at once so that memory problems are aovided.
Estimate a simple operation (measure mean) in one axis using "classical code" which means estimate mean each 1000 row.
Repeat previous step but using dask this time.

So for first step this is what I get so far:

# Write h5 dataset
chunks = (100,500,2)
tp = time.time()
with h5py.File('path/3D_matrix_1.hdf5', 'w') as f:
    # create a 3D dataset inside one h5py file
    dset = f.create_dataset('3D_matrix', (10000, 5000,2), chunks = chunks, maxshape= (None,5000,2 ),compression = 'gzip') # to append dato on axis 0
    print(dset.shape)
    while dset.shape < 4*10**7: # append data until axis 0 = 4*10**7
        dset.resize(dset.shape[0]+10**4, axis=0)  # resize data
        print(dset.shape) # check new shape for each append
        dset[-10**4:] = np.random.randint(2, size=(10**4, 5000, 2))
    tmp  = time.time()- tp
    print('Writting time: {}'.format(tmp))

For the second step I read previous dataset and measure time in estimating mean.

# Classical read
tp = time.time()
filename = 'path/3D_matrix_1.hdf5'
with h5py.File(filename, mode='r') as f:
    # List all groups (actually there is only one)
    a_group_key = list(f.keys())[0] # group the only one dataset in h5 File.
    # Get the data
    result = f.get(a_group_key)
    print(result.shape)
    #print(type(result))
    # read each 1000 elements
    start_ = 0 # initialize a start counter
    means = []
    while start_ < result.shape[0]:
        arr = np.array(result[start_:start_+1000])
        m = arr.mean()
        #print(m)
        means.append(m)
        start_ += 1000
    final_mean = np.array(means).mean()
    print(final_mean, len(final_mean))
    tmp  = time.time()- tp
    print('Total reading and measuring time withouth dask: {:.2f}'.format(tmp))

As an approach of third step I proceed as follows:

# Dask way
from dask import delayed
tp = time.time()
import dask.array as da
filename = 'path/3D_matrix_1.hdf5'
dset = h5py.File(filename, 'r')
dataset_names = list(dset.keys())[0] # to obtain dataset content
result = dset.get(dataset_names)
array = da.from_array(result,chunks = chunks) # should this be paralelized with delayed?
print('Gigabytes del input: {}'.format(array.nbytes / 1e9))  # Gigabytes of the input processed lazily
x = delayed(array.mean(axis=0)) # use delayed to parallelize (kind of...)
print('Mean array: {}'.format(x.compute())) 
tmp = time.time() - tp
print('Total reading and measuring time with dask: {:.2f}'.format(tmp))

I think I am missing some procedure seem dask time execution takes more than Classical method. Besides, I think chunk option could be the reason of this since I used same chunk size for both h5 dataset and dask.

Any suggestion of this procedure would be welcomed.

Fair and realistic comparison using dask

0 Answers0