1

I have hundreds of gzipped csv files in s3 that I am trying to load. The directory structure resembles the following:

bucket
-- level1
---- level2.1
-------- level3.1
------------ many files 
-------- level3.2
------------ many files 
---- level2.2
-------- level3.1
------------ many files 
-------- level3.2
------------ many files 

There may be several level2, level3 directories and many files under each. In the past I was loading the data using .textFile and passing the path using a wildcard like:

s3a://bucketname/level1/**

which worked fine to load all the files under all child paths. I am now trying to use the csv loading mechanism in spark 2 and I keep getting the following error:

java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:134)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:377)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30.apply(SparkContext.scala:1014)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30.apply(SparkContext.scala:1014)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
at scala.Option.foreach(Option.scala:257)

I have tried using the following paths:

  1. s3a://bucketname/level1/**
  2. s3a://bucketname/level1/
  3. s3a://bucketname/level1

All result in the same error. If I use s3a://bucketname/level1/level2.1/level3.1/ that works to load all the files under that one directory but if I try to use a higher level directory it fails.

My code to load is:

   Dataset<Row> csv = sparkSession.read()
            .option("delimiter", parseSettings.getDelimiter().toString())
            .option("quote", parseSettings.getQuote())
            .csv(path);

I though the csv loading used sparks normal file resolution strategy but the behavior seems to be different from using textFile, is there a way to achieve the loading of all the files with the csv format?

Thanks,
Nathan

Nathan Case
  • 655
  • 1
  • 6
  • 15

1 Answers1

0

sounds suspiciously like a bug.

That means: search for the error message and stack trace in issue.apache.org. FWIW, it could be SPARK-15473. If it's there & not yet, add to it with your stack; if not there add something new.

first: isolate it from s3 input; try to replicate it with file:// URLs. That'll help point the blame at the right piece of code.

Also, workaround time. The databricks CSV reader still works

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Running the code locally going against my local filesystem using file:// works correctly. I will take a look at the jira item. – Nathan Case Jan 30 '17 at 21:01