5

I can load the data only if I change the "anon" parameter to True after making the file public.

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False})

This is not recommended for obvious reasons. How do I load the data from S3 securely?

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
shantanuo
  • 31,689
  • 78
  • 245
  • 403

3 Answers3

8

The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers

df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'key': mykey, 'secret': mysecret})

The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.

General reference http://docs.dask.org/en/latest/remote-data-services.html

mdurant
  • 27,272
  • 5
  • 45
  • 74
4

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:

import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')

If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):

import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)

Full documentation from dask can be found here

Benjamin Cohen
  • 320
  • 3
  • 7
0

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.

$ export AWS_PROFILE=your_aws_cli_profile_name

or

https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html

For s3 you can use wildcard match to fetch multiple chunked files

import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))
Timothy Mugayi
  • 1,449
  • 17
  • 11
  • 1
    is there any limitation in terms of maximum of number of s3 files that dask can read at once in case of parquet files? – Dcook Apr 06 '21 at 04:23