Can I point a a Spark/Hadoop configuration to IBM Cloud Object Storage?

Question

I am using IBM object storage (comparable to AWS S3) to store data. IBM's object storage implements the S3 api. Spark's Hadoop Configuration can be modified to allow it to connect to Amazon S3.

I am attempting (in pyspark) to access my data by setting the endpoints to point to IBM, as opposed to Amazon.

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3-api.us-geo.objectstorage.service.networklayer.com")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', '<my_key>')
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")

Which throws the error:

An error occurred while calling o131.partitions.
: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 7F46A12CDBB841AA)

Note the "Service: Amazon S3;" Which makes me assume that the SDK is still pointed towards AWS. Can this be changed?

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

The endpoint you've specified is the 'private' endpoint and is for workloads that are sending requests within IBM Cloud/Softlayer data center network.

If you are trying to connect to the object store over the public internet, you need to use a different endpoint: s3-api.us-geo.objectstorage.softlayer.net. More information can be found in the (admittedly in-progress) documentation for the open trial.

Please let me know if that doesn't solve the issue - if it's a compatibility defect I'd like to make sure it gets resolved.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 11 '16 at 19:07

Nick Lange

857
6
8

Sadly, the error persists regardless of the which endpoint I use. I am currently trying to access Object Storage from a Softlayer virtual machine. – David Ott Nov 11 '16 at 20:56
I'll check with our testing team. If you'd like to discuss it in detail, feel free to email me at nicholas.lange [at] ibm.com. – Nick Lange Nov 16 '16 at 19:35
Nick, if you do have an implementation of the S3 protocols and want to interop with Hadoop, then I would suggest you check out Hadoop branch-2.8 and run its hadoop-aws integration test suites against your endpoint —ideally before the hadoop 2.8 RC goes out. Some other object stores do test against it, which is why we know to keep the multiple delete (and soon v2 list) calls optional —but it is always interesting to see what happens against other endpoints. see: https://issues.apache.org/jira/browse/HADOOP-11694 for what's coming – stevel Nov 29 '16 at 17:43
Thanks Steve! Appreciate the info. – Nick Lange Nov 30 '16 at 20:52

Can I point a a Spark/Hadoop configuration to IBM Cloud Object Storage?

1 Answers1