s3 path printed incorrectly by spark excel reader

Question

I am trying to read an excel sheet from Amazon S3 and here is the code snippet. But it fails saying file doesn't exist though its there , I checked there is a slash (/) missing from the path.

println(path)
val data = sqlContext.read.
    format("com.crealytics.spark.excel").
    option("location", s3path).
    option("useHeader", "true").
    option("treatEmptyValuesAsNulls", "true").
    option("inferSchema","true").
    option("addColorColumns", "true").
    load(path)

path is correctly printed as : s3a://AKIAJDDDDDDACNA:A6voquDDDDDqNOUsONDy@my-test/test.xlsx

But why the slash is missing when read by spark? Here is the error message :

 Name: java.io.FileNotFoundException
    Message: s3a:/AKIAJYDDDDDDNA:A6DDDDDDDDDwqxkRqUQyXqqNOUsONDy@my-test/test.xlsx (No such file or directory)
    StackTrace:   at java.io.FileInputStream.open0(Native Method)
      at java.io.FileInputStream.open(FileInputStream.java:212)
      at java.io.FileInputStream.<init>(FileInputStream.java:152)
      at java.io.FileInputStream.<init>(FileInputStream.java:104)
      at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:28)
      at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:31)
      at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:7)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:345)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
      at $anonfun$1.apply(<console>:46)
      at $anonfun$1.apply(<console>:46)
      at time(<console>:36)

You can obfuscate your aws credentials and show us the path you are giving to the reader ? — eliasah, Jun 08 '17 at 07:03

score 0 · Accepted Answer · answered Jun 08 '17 at 13:02

Somehow the s3a URL is getting down to java.io.FileInputStream.open(), which only works with local filesystem files, not HDFS, S3, etc. You will need to track down what is happening there inside com.crealytics.spark.excel. Welcome to the word of using IDEs to work out what third party libraries get up to :) (IntelliJ IDEA is very good at that BTW, as it can go from a pasted stack trace to the specific source code)

Also: don't put your secrets in your URLs, that's dangerous & something which may get disabled in future for security reasons. Set spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in your spark-defaults.conf.

score 0 · Answer 2 · answered Sep 03 '17 at 21:16

0

Alternatively you can use the HadoopOffice library to read/write Excel files. It supports Spark datasources, but is also Hadoop native, so your s3 URL will probably work out of the box.

https://github.com/ZuInnoTe/hadoopoffice/wiki

answered Sep 03 '17 at 21:16

Jörn Franke

186
4

s3 path printed incorrectly by spark excel reader

2 Answers2