0

I am running into a weird issue with spark 2.0, using the sparksession to load a text file. Currently my spark config looks like:

val sparkConf = new SparkConf().setAppName("name-here")
sparkConf.registerKryoClasses(Array(Class.forName("org.apache.hadoop.io.LongWritable"), Class.forName("org.apache.hadoop.io.Text")))
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val spark = SparkSession.builder()
    .config(sparkConf)
    .getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.enableServerSideEncryption", "true")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

If I load an s3a file through an rdd, it works fine. However, if I try to use something like:

    val blah = SparkConfig.spark.read.text("s3a://bucket-name/*/*.txt")
        .select(input_file_name, col("value"))
        .drop("value")
        .distinct()
    val x = blah.collect()
    println(blah.head().get(0))
    println(x.size)

I get an exception that says: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:

Do I need to add some addition s3a configuration for the sqlcontext or sparksession? I haven't found any question or online resource that specifies this. What is weird is that it seems like the job runs for 10 minutes, but then fails with this exception. Again, using the same bucket and everything, a regular load of an rdd has no issues.

The other weird thing is that it is complaining about s3 and not s3a. I have triple checked my prefix, and it always says s3a.

Edit: Checked both s3a and s3, both throw the same exception.

17/04/06 21:29:14 ERROR ApplicationMaster: User class threw exception: 
java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Expected scheme-specific part at index 3: s3:
java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Expected scheme-specific part at index 3: s3:
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:240)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1732)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:374)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:486)
at com.omitted.omitted.jobs.Omitted$.doThings(Omitted.scala:18)
at com.omitted.omitted.jobs.Omitted$.main(Omitted.scala:93)
at com.omitted.omitted.jobs.Omitted.main(Omitted.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.net.URISyntaxException: Expected scheme-specific part 
at index 3: s3:
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.failExpecting(URI.java:2854)
at java.net.URI$Parser.parse(URI.java:3057)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 26 more
17/04/06 21:29:14 INFO ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: 
java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Expected scheme-specific part at index 3: s3:)
Derek_M
  • 1,018
  • 10
  • 22

1 Answers1

0

This should work.

  • get the right JARs on your CP (Spark with Hadoop 2.7, matching hadoop-aws JAR, aws-java-sdk-1.7.4.jar (exactly this version) and joda-time-2.9.3.jar (or a later vesion)
  • you shouldn't need to set the fs.s3a.impl value, as that's done in the hadoop default settings. If you do find yourself doing that, it's a sign of a problem.

What's the full stack trace?

stevel
  • 12,567
  • 1
  • 39
  • 50
  • That is strange, but thanks. It was my aws-java-sdk version. I looked for a really long time, but didn't find documentation on this. Is it listed somewhere for spark? – Derek_M Apr 01 '17 at 13:12
  • Spark currently doesn't get its object store dependencies right, because it doesn't pull the AWS stuff in automatically, SPARK-7481 carries the patch for this —it would be great if you went to the long-standing pull request there and made clear why it matters to you. – stevel Apr 03 '17 at 09:35
  • Ran it with the updates today, and I still had problems. What is weird is that it fails after 8-10 minutes. I think the error message is a red herring though. – Derek_M Apr 06 '17 at 21:44
  • add the whole stack anyway: I'm curious now – stevel Apr 07 '17 at 13:54
  • Thanks. I added it above. – Derek_M Apr 07 '17 at 14:26
  • Also should add that this bucket has encryption turned on. – Derek_M Apr 07 '17 at 14:28
  • Ok. Something is calling `Dataframe.text("s3:")`; s3: is being rejected as an invalid URI. As to how that happens, your bits of the code, I suspect – stevel Apr 12 '17 at 18:18