S3 Checkpoint with Structured Streaming

Question

I have tried the suggestions given in the Apache Spark (Structured Streaming) : S3 Checkpoint support

I am still facing this issue. Below is the error i get

17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem 
name. Use "hdfs://s3n/" instead.
Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.UnknownHostException: s3n

I have something like this as part of my code

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .config("spark.hadoop.fs.defaultFS","s3")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    .config("spark.hadoop.fs.s3n.awsAccessKeyId","<my-key>")
    .config("spark.hadoop.fs.s3n.awsSecretAccessKey","<my-secret-key>")
    .appName("My Spark App")
    .getOrCreate();

and then checkpoint directory is being used like this:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

Any help is appreciated. Thanks in advance!

Try `config("spark.hadoop.fs.defaultFS","s3n")` and `.config("spark.hadoop.fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")`. Although I definitely don't recommend using `S3` as a distributed file system for Spark, it has eventual consistency on reads. — Yuval Itzchakov, Jul 07 '17 at 14:46
I think `s3a` is the newer of the two. But I meant you generally don't want to use S3 at all. — Yuval Itzchakov, Jul 07 '17 at 14:54
That deprecation method is an odd one. It's telling you off for using a filesystem reference like "localhost:8080" as the name of the (HDFS) instance, when it now expects a schema like "hdfs://localhost:8080/. If it saying that for any other filesystem (here, s3), then t has got confused, — stevel, Jul 10 '17 at 09:33
In your comment to this post you have mentioned like **checkpoint to S3, but have a long gap between checkpoints so that the time to checkpoint doesn't bring your streaming app down** My question here is right now, even checkpointing possible in s3 with structured streaming? — fledgling, Jul 11 '17 at 14:13

score 4 · Accepted Answer · answered Jul 16 '17 at 15:36

For checkpointing support of S3 in Structured Streaming you can try following way:

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("My Spark App")
    .getOrCreate();

spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<my-key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<my-secret-key>")

and then checkpoint directory can be like this:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

I hope this helps!

S3 Checkpoint with Structured Streaming

1 Answers1