2

I have tried the suggestions given in the Apache Spark (Structured Streaming) : S3 Checkpoint support

I am still facing this issue. Below is the error i get

17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem 
name. Use "hdfs://s3n/" instead.
Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.UnknownHostException: s3n

I have something like this as part of my code

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .config("spark.hadoop.fs.defaultFS","s3")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    .config("spark.hadoop.fs.s3n.awsAccessKeyId","<my-key>")
    .config("spark.hadoop.fs.s3n.awsSecretAccessKey","<my-secret-key>")
    .appName("My Spark App")
    .getOrCreate();

and then checkpoint directory is being used like this:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

Any help is appreciated. Thanks in advance!

zero323
  • 322,348
  • 103
  • 959
  • 935
fledgling
  • 991
  • 4
  • 25
  • 48
  • Try `config("spark.hadoop.fs.defaultFS","s3n")` and `.config("spark.hadoop.fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")`. Although I definitely don't recommend using `S3` as a distributed file system for Spark, it has eventual consistency on reads. – Yuval Itzchakov Jul 07 '17 at 14:46
  • when do I use s3a and when s3n? – fledgling Jul 07 '17 at 14:52
  • I think `s3a` is the newer of the two. But I meant you generally don't want to use S3 at all. – Yuval Itzchakov Jul 07 '17 at 14:54
  • That didn't work too.. – fledgling Jul 07 '17 at 15:43
  • That deprecation method is an odd one. It's telling you off for using a filesystem reference like "localhost:8080" as the name of the (HDFS) instance, when it now expects a schema like "hdfs://localhost:8080/. If it saying that for any other filesystem (here, s3), then t has got confused, – stevel Jul 10 '17 at 09:33
  • In your comment to this post you have mentioned like **checkpoint to S3, but have a long gap between checkpoints so that the time to checkpoint doesn't bring your streaming app down** My question here is right now, even checkpointing possible in s3 with structured streaming? – fledgling Jul 11 '17 at 14:13

1 Answers1

4

For checkpointing support of S3 in Structured Streaming you can try following way:

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("My Spark App")
    .getOrCreate();

spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<my-key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<my-secret-key>")

and then checkpoint directory can be like this:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

I hope this helps!

himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70