0

Being fairly new to spark and while working with Spark Structured Streaming (v2.4.3) I am trying to write my streaming dataframe to a custom S3. I have made sure that I am able to login, upload data to s3 buckets manually using UI and have also setup ACCESS_KEY and SECRET_KEY for it.

val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-region1.myObjectStore.com:443")
sc.hadoopConfiguration.set("fs.s3a.access.key", "00cce9eb2c589b1b1b5b")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "flmheKX9Gb1tTlImO6xR++9kvnUByfRKZfI7LJT8")
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") // bucket name appended as url/bucket and not bucket.url
val writeToS3Query = stream.writeStream
      .format("csv")
      .option("sep", ",")
      .option("header", true)
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("30 seconds"))
      .option("path", "s3a://bucket0/")
      .option("checkpointLocation", "/Users/home/checkpoints/s3-checkpointing")
      .start()

However, I am getting the error that

Unable to execute HTTP request: bucket0.s3-region1.myObjectStore.com: nodename nor servname provided, or not known

I have mapping of URL and IP in my /etc/hosts file and the bucket is accessable from other sources. Is there any other way to do this successfully? I am really not sure why bucket name is being appended before URL when it is executed by Spark.

Can this be because I am setting up the spark context hadoop configurations after session is created and so they are not effective? But then how it is able to refer the actual URL when in the path I am providing value as s3a://bucket0.

a13e
  • 838
  • 2
  • 11
  • 27
  • 1
    Try Spark SQL first before reverting to Structured Streaming. That gives you a simpler environment to work with. Have you seen https://stackoverflow.com/q/52757599/1305344? – Jacek Laskowski Dec 04 '19 at 09:03

2 Answers2

0

This stuff is probably easier to set up in spark-defaults.conf

  1. Try using an all lower case hostname
  2. remove the :443 from the reference; https is the default and there's a switch to explicitly disable that.
  3. secret key property is "fs.s3a.secret.key"
stevel
  • 12,567
  • 1
  • 39
  • 50
0

I solved this issue by setting hadoop-aws jar version to 2.8.0 in my build.sbt. Seems like the separate flag fs.s3a.path.style.access was introduced in Hadoop 2.8.0 as I found a JIRA ticket HADOOP-12963 for this issue. And it worked.

a13e
  • 838
  • 2
  • 11
  • 27