I'm trying to stream data from kafka (json messages) and write it to aws s3 using apache spark's (2.4.0) structured streaming api.
But I get an exception from aws lib, without much detail.
I've tried writing on local fs and hdfs, it works properly.
For S3, I'm able to list files via hdfs cli using hdfs dfs -ls s3://<bucket-name>/test/
A snippet of what I'm trying to do
val spark = SparkSession.builder()
.master("local[*]")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.access.key", "xxxx")
.config("spark.hadoop.fs.s3a.secret.key", "xxxx")
.config("spark.hadoop.fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com")
.getOrCreate()
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe", "test")
.load()
// SOME ETL THEN
df.writeStream
.outputMode("append")
.option("checkpointLocation", "s3a://<bucket-name>/test/")
.format("json")
.option("path", "s3a://<bucket-name>/test/")
.start()
The Exception I get
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Error Code: null, AWS Error Message: Bad Request