1

I'm trying out dask on a simple embarassingly parallel reading of 24 scientific data files, each of ~250MB, so total ~6GB. The data is in a 2D array format. Its stored on a parallel file system, and read in from a cluster, though I'm reading only from a single node right now. The data is in a format similar to HDF5 (called Adios), and is read similar to h5py package. Each file takes about 4 seconds to read. I'm following the example of skimage read here (http://docs.dask.org/en/latest/array-creation.html). However, I never get a speed up, no matter how many workers. I thought perhaps I was using it wrong, and perhaps only using 1 worker still, but when I profile it, there does appear to be 24 workers. How can I get a speed up for reading this data? dask_profile

import adios as ad
import numpy as np
import dask.array as da
import dask

bpread = dask.delayed(lambda f: ad.file(f)['data'][...],pure=True)
lazy_datas = [bpread(path) for path in paths]
sample = lazy_datas[0].compute()

#read in data
arrays = [da.from_delayed(lazy_data,dtype=sample.dtype,shape=sample.shape) for lazy_data in lazy_datas]
datas = da.stack(arrays,axis=0)
datas2 = datas.compute(scheduler='processes',num_workers=24)
Michael
  • 486
  • 6
  • 19

1 Answers1

0

I recommend looking at the /profile tab of the scheduler's dashboard. This will tell you what lines of code are taking up the most time.

My first guess is that you are already maxing out your disk's ability to serve data to you. You aren't CPU bound, so adding more cores won't help. That's just a guess though, as always you'll have to profile and investigate your situation further to know for sure.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks, I'll take a look. I don't think this is maxing disk ability; I wrote a separate mpi4py implementation which is able to scale well with more ranks. – Michael Oct 07 '18 at 03:40