2

I'm trying to use Dask to get multiple files (JSON) from AWS S3 into memory in a Sagemaker Jupyter Notebook. When I submit 10 or 20 workers, everything runs smoothly. However, when I submit 100 workers, between 30% and 50% of them encounter the following error: 'Unable to locate credentials'

Initially I was trying with Boto3. In order to try to eliminate this issue, I switched to S3FS but the same error is occurring.

The workers which err out with the NoCredentialError are random if I repeat the experiment, as is the exact number of failed downloads.

Sagemaker is handling all the AWS credentials through its IAM role, so I have no access to key pairs or anything. The ~/.aws/config file contains only the default location - nothing about credentials.

It seems this is a very common use for Dask so it's obviously capable of performing such a task - where am I going wrong?

Any help would be much appreciated! Code and traceback below. In this example, 29 workers failed due to credentials. Thanks, Patrick

import boto3
import json
import logging
import multiprocessing
from dask.distributed import Client, LocalCluster
import s3fs
import os

THREADS_PER_DASK_WORKER = 4
CPU_COUNT = multiprocessing.cpu_count()
HTTP_SUCCESSFUL_REQUEST_CODE = 200

S3_BUCKET_NAME = '-redacted-'

keys_100 = ['-redacted-']
keys_10 = ['-redacted-']

def dispatch_workers(workers):

    cluster_workers = min(len(workers), CPU_COUNT)
    cluster = LocalCluster(n_workers=cluster_workers, processes=True,
                           threads_per_worker=THREADS_PER_DASK_WORKER)
    client = Client(cluster)

    data = []
    data_futures = []

    for worker in workers:
        data_futures.append(client.submit(worker))

    for future in data_futures:
        try:
            tmp_flight_data = future.result()
            if future.status == 'finished':
                data.append(tmp_flight_data)
            else:
                logging.error(f"Future status = {future.status}")
        except Exception as err:
            logging.error(err)

    del data_futures

    cluster.close()
    client.close()

    return data

def _get_object_from_bucket(key):

    s3 = s3fs.S3FileSystem(anon=False)# uses default credentials
    with s3.open(os.path.join(S3_BUCKET_NAME,key)) as f:
        return json.loads(f.read())

def get_data(keys):

    objects = dispatch_workers(
        [lambda key=key: _get_object_from_bucket(key) for key in keys]
    )
    return objects
    
data = get_data(keys_100)

Output:

ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
ERROR:root:Unable to locate credentials
PHinchey
  • 31
  • 1

0 Answers0