Spark Scala code to read aws s3 storage in DSX

Question

Any ideas how to read aws s3 with scala. I tried this link

https://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_s3.html

But could not get it to work. I can do so the same in data bricks but with dsx its not working etc.

IBM has documented steps for python here but none for scala - https://datascience.ibm.com/blog/use-ibm-data-science-experience-to-read-and-write-data-stored-on-amazon-s3/

spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xyz") spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "abc")

val df_data_1 = spark.read.format("csv").option("header", "true").load("s3a://defg/retail-data/by-day/*.csv") df_data_1.take(5)

do you want to use spark to read from s3?? update your title and tags if so. Also post the code you wrote which is not working — prayagupa, Sep 25 '17 at 16:38

score 2 · Answer 1 · answered Sep 26 '17 at 16:25

Not sure if there is any difference between using native(s3n) vs s3a. But s3a works fine.

import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder().
    getOrCreate()


val hconf = spark.sparkContext.hadoopConfiguration
hconf.set("fs.s3a.access.key", "XXXXXXXXX")  
hconf.set("fs.s3a.secret.key", "XXXXXXXXX") 


val dfData1 = spark.
    read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").
    option("header", "true").
    option("inferSchema", "true").
    load("s3a://charlesbuckets31/users.csv")
dfData1.show(5)

Thanks, Charles.

Difference between S3n and S3a is significant as in "s3a is and will be maintained"; s3n is its predecessor. BTW, schema inference means one scan of the data just to work out the schema, another to read. Best to declare the schema in your code — stevel, Sep 27 '17 at 15:37

Spark Scala code to read aws s3 storage in DSX

1 Answers1