8

I'm trying to send data from the workers of a Pyspark RDD to an SQS queue, using boto3 to talk with AWS. I need to send data directly from the partitions, rather than collecting the RDD and sending data from the driver.

I am able to send messages to SQS via boto3 locally & from the Spark driver; also, I can import boto3 and create a boto3 session on the partitions. However when I try to create a client or resource from the partitions I receive an error. I believe boto3 is not correctly creating a client, but I'm not entirely sure on that point. My code looks like this:

def get_client(x):   #the x is required to use pyspark's mapPartitions
    import boto3
    client = boto3.client('sqs', region_name="us-east-1", aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretaccesskey")
    return x

rdd_with_client = rdd.mapPartitions(get_client)

The error:

DataNotFoundError: Unable to load data for: endpoints

The longer traceback:

File "<stdin>", line 4, in get_client
  File "./rebuilt.zip/boto3/session.py", line 250, in client
    aws_session_token=aws_session_token, config=config)
  File "./rebuilt.zip/botocore/session.py", line 810, in create_client
    endpoint_resolver = self.get_component('endpoint_resolver')
  File "./rebuilt.zip/botocore/session.py", line 691, in get_component
    return self._components.get_component(name)
  File "./rebuilt.zip/botocore/session.py", line 872, in get_component
    self._components[name] = factory()
  File "./rebuilt.zip/botocore/session.py", line 184, in create_default_resolver
    endpoints = loader.load_data('endpoints')
  File "./rebuilt.zip/botocore/loaders.py", line 123, in _wrapper
    data = func(self, *args, **kwargs)
  File "./rebuilt.zip/botocore/loaders.py", line 382, in load_data
    raise DataNotFoundError(data_path=name)
DataNotFoundError: Unable to load data for: endpoints

I've also tried modifying my function to create a resource instead of the explicit client, to see if it could find & use the default client setup. In that case, my code is:

def get_resource(x):
    import boto3
    sqs = boto3.resource('sqs', region_name="us-east-1", aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretaccesskey")
    return x

rdd_with_client = rdd.mapPartitions(get_resource)

I receive an error pointing to a has_low_level_client parameter, which is triggered because the client doesn't exist; the traceback says:

File "/usr/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
  File "/usr/lib/spark/python/pyspark/rdd.py", line 270, in func
  File "/usr/lib/spark/python/pyspark/rdd.py", line 689, in func
  File "<stdin>", line 4, in session_resource
  File "./rebuilt.zip/boto3/session.py", line 329, in resource
    has_low_level_client)
ResourceNotExistsError: The 'sqs' resource does not exist.
The available resources are:
   -

No resources available because, I think, there's no client to house them.

I've been banging my head against this one for a few days now. Any help appreciated!

EmmaOnThursday
  • 167
  • 4
  • 9
  • Please find where botocore is installed and check the `data` subdirectory. You should also make sure that you have the ability to read from disk. – Jordon Phillips Jun 21 '16 at 18:18
  • Hi Jordan, what am I looking for in the data subdirectory? I have a file called endpoints.json there, but that's all that looks related to this traceback. – EmmaOnThursday Jun 21 '16 at 18:41
  • For whatever reason, botocore is not able to access that `endpoints.json` file, and `boto3` is likewise not able to access the data in its directories. My thought was that it was either not there at all, or that you environment prevents it from being accessed. – Jordon Phillips Jun 21 '16 at 18:44

1 Answers1

13

This is because you have the boto3 bundle as a zip file.

"./rebuilt.zip/boto3"

What boto3 does for initialisation is it will download a bunch files and save it inside the distribution folder. Because your boto3 lives in a zip package, so obviously those files won't be able to it to there.

Solution is, rather then distribute boto3 inside a zip, you should have boto3 installed on your Spark environment. Be careful here, you might want to install boto3 both on the master node and worker nodes, depends on how you implement your app. Safe bet is install on both.

If you are using EMR, you can use bootstrap step to do it. Here is the detail document.

If you're using AWS Glue 2.0, you can use --additional-python-modules to include boto3. Here is the detail document.

If you're using GCP Dataproc, you can archive that by specifying cluster properties. Here is the detail document.

Tom Tang
  • 1,064
  • 9
  • 10
  • Alternatively, u can bundle those data files in your "./rebuilt.zip/boto3" – Tom Tang Feb 08 '17 at 01:03
  • well, I'm facing the same problem here, this really sucks, I'd fall back on boto anyway, since it's installed on emr clusters by default. – avocado Sep 11 '17 at 02:14
  • Or, you can always use bootstrap actions to run "pip install boto3". No need to fallback to boto, as it is going to be sunset. – Tom Tang Jan 01 '18 at 23:43
  • @LiyingTang how would I locate the files it downloads during initialization so I can store it in the zip file? – user422930 Jan 19 '18 at 20:18
  • It may work although it sounds a bit "hacky". Those JSON files may be downloaded after the initialization. So you can do is to initialize the AWS service you use with the boto3, and pack exactly that boto3 copies into your package. – Tom Tang Jan 28 '18 at 02:48
  • But the way I preferred is the what I suggest in my answer. Install boto3 into the system python in your cluster, and don't pack boto3 into the zip. When importing boto3, it will use the one installed on system python, and those data files should be handled properly. – Tom Tang Jan 28 '18 at 02:50
  • 1
    removing boto3 and botocore from the zip file and including this param "--additional-python-modules boto3" solved the issue. Thanks. – sgalinma Sep 28 '22 at 09:52
  • Thanks @sgalinma i will update to the answer. – Tom Tang Sep 29 '22 at 07:51