reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

Question

I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like

dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv')

This causes Dask to hang. The real problem is that s3fs.glob hangs trying to resolve a glob pattern with that many stars. I tried replacing the glob by an explicit list computed by boto3.list_objects but that can return a maximum of a 1000 paths; I have orders of magnitude more.

How can I efficiently specify this set of files to dask.dataframe.read_csv?

One way to reframe this question could be: How do I efficiently obtain a complete recursive listing of a large S3 bucket in Python? That ignores a possiblity of the there being some other pattern based way of calling dask.dataframe.read_csv.

score 3 · Accepted Answer · answered Jun 12 '19 at 08:53

3

You can use Paginatiors in boto3 to list all objects in your bucket. You can also specify what prefix you want to restrict your search to. Sample of such code is given in the documentation, you can simply copy paste it and replace the bucket name and prefix.

import boto3

client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
operation_parameters = {'Bucket': 'my-bucket',
                        'Prefix': 'foo/baz'}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
    print(page['Contents'])

answered Jun 12 '19 at 08:53

Ninad Gaikwad

4,272
2
13
23

This is the right answer to the question as posed; s3fs (used in Dask) is really trying to work directory-by-directory. However, just having so many files and levels is probably a suboptimal pattern. – mdurant Jun 12 '19 at 13:16
@mdurant It is suboptimal for processing with dask etc, but that is the data I need to process. – Daniel Mahler Jun 18 '19 at 19:15

reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

1 Answers1