3

I have an input file that is custom delimited and is passed to newAPIHadoopFile to convert as RDD[String]. The file resides under the project resource directory. The following code works well when run from the Eclipse IDE.

  val path = this.getClass()
                 .getClassLoader()
                 .getResource(fileName)                   
                 .toURI().toString()
  val conf = new org.apache.hadoop.conf.Configuration() 
  conf.set("textinputformat.record.delimiter", recordDelimiter)

  return sc.newAPIHadoopFile(
      path,
      classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
      classOf[org.apache.hadoop.io.LongWritable], 
      classOf[org.apache.hadoop.io.Text], 
      conf)
    .map(_._2.toString) 

However when I run it on spark-submit (with a uber jar) as follows

   spark-submit /Users/anon/Documents/myUber.jar

I get the below error.

 Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/Users/anon/Documents/myUber.jar!/myhome-data.json

Any inputs please?

user1384205
  • 1,231
  • 3
  • 20
  • 39

1 Answers1

2

If the file is for sc.newAPIHadoopFile that requires a path not an input stream, I'd recommend using --files option of spark-submit.

--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).

See SparkFiles.get method:

Get the absolute path of a file added through SparkContext.addFile().

With that, you should use spark-submit as follows:

spark-submit --files fileNameHere /Users/anon/Documents/myUber.jar

In a general case, if a file is inside a jar file, you should use InputStream to access the file (not as a File directly).

The code could look as follows:

val content = scala.io.Source.fromInputStream(
  classOf[yourObject].getClassLoader.getResourceAsStream(yourFileNameHere)

See Scala's Source object and Java's ClassLoader.getResourceAsStream method.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420