I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like
dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv')
This causes Dask to hang. The real problem is that s3fs.glob
hangs trying to resolve a glob pattern with that many stars. I tried replacing the glob by an explicit list computed by boto3.list_objects
but that can return a maximum of a 1000 paths; I have orders of magnitude more.
How can I efficiently specify this set of files to dask.dataframe.read_csv
?
One way to reframe this question could be: How do I efficiently obtain a complete recursive listing of a large S3 bucket in Python? That ignores a possiblity of the there being some other pattern based way of calling dask.dataframe.read_csv
.