Loading data from S3 to dask dataframe

Question

I can load the data only if I change the "anon" parameter to True after making the file public.

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False})

This is not recommended for obvious reasons. How do I load the data from S3 securely?

score 8 · Accepted Answer · answered Jan 14 '19 at 14:50

The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.

General reference http://docs.dask.org/en/latest/remote-data-services.html

Benjamin Cohen · Answer 2 · 2019-08-07T14:51:51.993

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:

import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')

If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):

import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)

Full documentation from dask can be found here

Timothy Mugayi · Answer 3 · 2020-01-06T08:54:49.047

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.

$ export AWS_PROFILE=your_aws_cli_profile_name

or

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

For s3 you can use wildcard match to fetch multiple chunked files

import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))

is there any limitation in terms of maximum of number of s3 files that dask can read at once in case of parquet files? — Dcook, Apr 06 '21 at 04:23

Loading data from S3 to dask dataframe

3 Answers3