dask s3 access on ec2 workers

Question

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command:

client = Client('scheduler-on-ec2')
df = read_csv('s3://xyz/*csv.gz',
              compression='gzip',
              blocksize=None,
              #storage_options={'key': '', 'secret': ''}
             )
df.size.compute()

Data is look like read locally (by local python interpreter, not the workers), then is sent to workers (or scheduler?) by the local interpreter, and when the workers receive the chunks, they run the compute and return the results. Same with or without passing the key and the secret via storage_options.

When I read from a public s3 bucket (nyc taxi data), with storage_options={'anon': True}, everything looks okay.

What do you think the problem is and what should I reconfigure change to get the workers read directly from s3?

s3fs is installed correctly, and these are the supported filesystems according to dask:

>>>> dask.bytes.core._filesystems
{'file': dask.bytes.local.LocalFileSystem,
 's3': dask.bytes.s3.DaskS3FileSystem}

Update

After monitoring network interfaces, it looks like something is uploaded from the interpreter to the scheduler. The more partitions there are in the dataframe (or bag), the bigger the data is sent to scheduler. I thought it could be the computation graph, but it is really big. For 12 files, it is 2-3MB, for 30 files it is 20MB and for larger data, (150 files) it just takes too long to send it to the scheduler and I didn't wait it. What else is being sent to the scheduler that can take up this amount of data?

> What else is being sent to the scheduler that can take up this amount of data? As far as I know. Nothing. If you can produce a reproducible [minimal failing example](http://stackoverflow.com/help/mcve) I recommend submitting something on Github. When I try this problem everything runs fine. You could try [inspecting the dask graph manually](http://dask.pydata.org/en/latest/inspect.html). — MRocklin, Mar 03 '17 at 14:58

score 0 · Answer 1 · answered Mar 02 '17 at 14:15

0

When you call dd.read_csv('s3://...') the local machine will read a little bit of the data in order to guess column names, dtypes, etc.. However the workers will read in the majority of the data directly.

When using the distributed scheduler, Dask does not load data in the local machine and then pump it out to the workers. As you suggest, this would be inefficient.

You might want to look at the web diagnostic pages to get more information about what is taking time.

answered Mar 02 '17 at 14:15

MRocklin

55,641
23
163
235

Yes, I was checking the web diagnostic pages. When I a small amount of csvs and monitored my network interface I saw that my machine is uploading something to ec2 for a while, and when it finished, the tasks started to run on workers as expected, showing on the diagnostics page. When running on more csvs, the uploading didn't finish at all. I am not sure what is happening, it was just a guess that all the data is being uploaded, but I have no idea about what is going on under the hood. – zseder Mar 02 '17 at 15:47

dask s3 access on ec2 workers

1 Answers1