Reading multiple files with Dask

Question

I'm trying out dask on a simple embarassingly parallel reading of 24 scientific data files, each of ~250MB, so total ~6GB. The data is in a 2D array format. Its stored on a parallel file system, and read in from a cluster, though I'm reading only from a single node right now. The data is in a format similar to HDF5 (called Adios), and is read similar to h5py package. Each file takes about 4 seconds to read. I'm following the example of skimage read here (http://docs.dask.org/en/latest/array-creation.html). However, I never get a speed up, no matter how many workers. I thought perhaps I was using it wrong, and perhaps only using 1 worker still, but when I profile it, there does appear to be 24 workers. How can I get a speed up for reading this data?

import adios as ad
import numpy as np
import dask.array as da
import dask

bpread = dask.delayed(lambda f: ad.file(f)['data'][...],pure=True)
lazy_datas = [bpread(path) for path in paths]
sample = lazy_datas[0].compute()

#read in data
arrays = [da.from_delayed(lazy_data,dtype=sample.dtype,shape=sample.shape) for lazy_data in lazy_datas]
datas = da.stack(arrays,axis=0)
datas2 = datas.compute(scheduler='processes',num_workers=24)

score 0 · Answer 1 · answered Oct 06 '18 at 12:43

0

I recommend looking at the /profile tab of the scheduler's dashboard. This will tell you what lines of code are taking up the most time.

My first guess is that you are already maxing out your disk's ability to serve data to you. You aren't CPU bound, so adding more cores won't help. That's just a guess though, as always you'll have to profile and investigate your situation further to know for sure.

answered Oct 06 '18 at 12:43

MRocklin

55,641
23
163
235

Thanks, I'll take a look. I don't think this is maxing disk ability; I wrote a separate mpi4py implementation which is able to scale well with more ranks. – Michael Oct 07 '18 at 03:40

Reading multiple files with Dask

1 Answers1

Linked