0

I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly. I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.

I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:

  • hadoop-aws-2.7.7.jar
  • hadoop-auth-2.7.7.jar
  • aws-java-sdk-1.7.4.jar

Still I can't use nor S3N nor S3A to get my file read by spark:

For S3A I have this exception:

sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden

Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url. This code gives me the following file url:

https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey

How should I install / set up / download to get spark able to read and write from my S3 server ?

Edit 3:

Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.

Kiwy
  • 340
  • 2
  • 10
  • 43
  • see https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics – stevel Oct 11 '18 at 14:49
  • Possible duplicate of [java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics) – stevel Oct 11 '18 at 14:50
  • If you get permission denied, then your classpath is correct... It's AWS (or minio) that's denying your keys... You can minio github issues about Spark support – OneCricketeer Oct 12 '18 at 04:20
  • forget about S3n, it's no longer maintained and underperforms. Focus on s3a and deal with the classpath. To debug download the full hadoop distribution, place your s3a key secrets into core-site and then run the diagnostics entry point in https://github.com/steveloughran/cloudstore ; its the self-diagnostics code I point everyone at – stevel Oct 12 '18 at 09:04
  • that's a way of saying "your classpath is still broken"; that's a file in hadoop-aws JAR. For Hadoo 3+, you can edit `~/.hadooprc` to pull it in `hadoop_add_to_classpath_tools hadoop-aws' – stevel Oct 16 '18 at 12:09
  • @SteveLoughran I figure this out and manage to run your tool. I've put the output in the question. Thank you so much for your help. – Kiwy Oct 16 '18 at 13:29
  • oh, this is a non-AWS S3 service. Try setting `fs.s3a.signing-algorithm` to `S3SignerType` – stevel Oct 16 '18 at 17:32
  • @SteveLoughran You could post this as an answer as this is working. I couldn't express enough gratitude. Thank you very much for your patience also on the Apache Jira .I wanted to change this parameter but I couldn't find value to use in documentation. Thank you thank you thank you. – Kiwy Oct 17 '18 at 06:27
  • It seems still that Spark do not consider properly path style and fail to read anything. – Kiwy Oct 17 '18 at 07:40
  • "You could post this as an answer as this is working." No, as the whole post is a duplicate of another issue. – stevel Oct 17 '18 at 09:32
  • "t seems still that Spark do not consider properly path style and fail to read anything.". Should do; it's just using the S3A connector underneath. Afraid you are into debug time: tunr org.apache.hadoop.fs.s3a log level to DEBUG – stevel Oct 17 '18 at 09:33
  • @SteveLoughran not a duplicate I insist, I can connect to this S3 server using hadoop 2.9.1, impossible to reproduce with hadoop 2.7.7 and hadoop 2.8.5 where I have always missing classes happening. with the same classpath, and cloudstore does give even different result than a simple bin/hadoop fs -ls s3a://test/ I seriously considering the fact that some build are very broken regarding dependencies. but I'm probably wrong, still I can't understand something that look as easy as connecting a web API to list a folder can be such a hassle – Kiwy Oct 17 '18 at 09:37
  • Still thank you a lot for your support – Kiwy Oct 17 '18 at 09:40
  • I was the release manager for Hadoop 2.7.7, let me reassure you I ran all the hadoop-aws tests for that against AWS s3. Regarding why things "as easy as connecting a web API to list a folder", can I point you at the list of changes between 2.7.x and 2.8.x for the the S3A connector alone, consider that similar JIRAs cover: 2.9, 3.0, 3.1 & 3.2. Mixing JARs doesn't work. https://issues.apache.org/jira/browse/HADOOP-11694 – stevel Oct 17 '18 at 10:18
  • Can I add: I if you want to add a section to the S3A docs about working with non-AWS endpoints, add a JIRA under https://issues.apache.org/jira/browse/HADOOP-15620 with a patch to the hadoop-aws markdown files. thanks – stevel Oct 17 '18 at 10:24
  • @SteveLoughran Once done with this setup I'll try to wirte some documentation and I could add it to the project. If you could take a last look I've basically rewrite my whole question and I think there might be a problem either with my classpath either with hadoop or my server but something is not right. I'm more sure now as it's now been 7 days I'm working on this issue. – Kiwy Oct 17 '18 at 13:07

1 Answers1

2

First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

Here's my final code:

sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()

I would recommand to put most of the settings inside spark-defaults.conf:

spark.hadoop.fs.s3a.impl                   org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access      true
spark.hadoop.fs.s3a.endpoint               mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm      S3SignerType

One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.

PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.

Kiwy
  • 340
  • 2
  • 10
  • 43
  • 2
    great writeup. You don't need to set spark.hadoop.fs.s3a.impl BTW, it is worked out automatically from options in /core-default.xml in hadoop-common. If the connection timeout value has changed, that wasn't intentional... – stevel Oct 24 '18 at 16:56
  • @kiwy correct me but I think version for aws-java-sdk.jar is 1.11.x and hadoop is 2.7.3. So I am not sure what do you mean by version that matched with hadoop of spark-hadoop install. I have found maximum downloads with for aws-java-sdk.jar v1.11.656. – a13e Apr 13 '20 at 23:31
  • @AniruddhaTekade What I mean, is that you should check what installation of spark and hadoop you have, once found your hadoop version, the easiest way to not make mistake is to download the corresponding hadoop release and get the named jar and put them in spark/jars folder – Kiwy Apr 14 '20 at 14:35