I have hundreds of gzipped csv files in s3 that I am trying to load. The directory structure resembles the following:
bucket
-- level1
---- level2.1
-------- level3.1
------------ many files
-------- level3.2
------------ many files
---- level2.2
-------- level3.1
------------ many files
-------- level3.2
------------ many files
There may be several level2, level3 directories and many files under each. In the past I was loading the data using .textFile and passing the path using a wildcard like:
s3a://bucketname/level1/**
which worked fine to load all the files under all child paths. I am now trying to use the csv loading mechanism in spark 2 and I keep getting the following error:
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:134)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:377)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30.apply(SparkContext.scala:1014)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$30.apply(SparkContext.scala:1014)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
at scala.Option.foreach(Option.scala:257)
I have tried using the following paths:
- s3a://bucketname/level1/**
- s3a://bucketname/level1/
- s3a://bucketname/level1
All result in the same error. If I use s3a://bucketname/level1/level2.1/level3.1/ that works to load all the files under that one directory but if I try to use a higher level directory it fails.
My code to load is:
Dataset<Row> csv = sparkSession.read()
.option("delimiter", parseSettings.getDelimiter().toString())
.option("quote", parseSettings.getQuote())
.csv(path);
I though the csv loading used sparks normal file resolution strategy but the behavior seems to be different from using textFile, is there a way to achieve the loading of all the files with the csv format?
Thanks,
Nathan