I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command:
client = Client('scheduler-on-ec2')
df = read_csv('s3://xyz/*csv.gz',
compression='gzip',
blocksize=None,
#storage_options={'key': '', 'secret': ''}
)
df.size.compute()
Data is look like read locally (by local python interpreter, not the workers), then is sent to workers (or scheduler?) by the local interpreter, and when the workers receive the chunks, they run the compute and return the results. Same with or without passing the key and the secret via storage_options
.
When I read from a public s3 bucket (nyc taxi data), with storage_options={'anon': True}
, everything looks okay.
What do you think the problem is and what should I reconfigure change to get the workers read directly from s3?
s3fs is installed correctly, and these are the supported filesystems according to dask:
>>>> dask.bytes.core._filesystems
{'file': dask.bytes.local.LocalFileSystem,
's3': dask.bytes.s3.DaskS3FileSystem}
Update
After monitoring network interfaces, it looks like something is uploaded from the interpreter to the scheduler. The more partitions there are in the dataframe (or bag), the bigger the data is sent to scheduler. I thought it could be the computation graph, but it is really big. For 12 files, it is 2-3MB, for 30 files it is 20MB and for larger data, (150 files) it just takes too long to send it to the scheduler and I didn't wait it. What else is being sent to the scheduler that can take up this amount of data?