0

I'm pretty new to functional programming and don't have an imperative programming background. Running through some basic scala/spark tutorials online and having some difficulty submitting a Scala application through spark-submit.

In particular I'm getting a java.lang.ArrayIndexOutOfBounds 0 Exception, which I have researched and found out that the array element at position 0 is the culprit. Looking into it further, I saw that some basic debugging could tell me if the Main application was actually picking up the argument at runtime - which it was not. Here is the code:

import org.apache.spark.{SparkConf, SparkContext}

object SparkMeApp {
  def main(args: Array[String]) {

    try {
      //program works fine if path to file is hardcoded
      //val logfile = "C:\\Users\\garveyj\\Desktop\\NetSetup.log"
      val logfile = args(0)
      val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[*]")
      val sc = new SparkContext(conf)
      val logdata = sc.textFile(logfile, 2).cache()
      val numFound = logdata.filter(line => line.contains("found")).count()
      val numData = logdata.filter(line => line.contains("data")).count()
      println("")
      println("Lines with found: %s, Lines with data: %s".format(numFound, numData))
      println("")
    }
    catch {
      case aoub: ArrayIndexOutOfBoundsException => println(args.length)
    }
  }
}

To submit the application using spark-submit I use:

spark-submit --class SparkMeApp --master "local[*]" --jars target\scala-2.10\firstsparkapplication_2.10-1.0.jar NetSetup.log

...where NetSetup.log is in the same directory as where I'm submitting the application. The output of the application is simply: 0. If I remove the try/catch, the output is:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
        at SparkMeApp$.main(SparkMeApp.scala:12)
        at SparkMeApp.main(SparkMeApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It's worth pointing out that the application runs fine if I remove the argument and hard code the path to the log file. Don't really know what I'm missing here. Any direction would be appreciated. Thanks in advance!

Jonathan Garvey
  • 115
  • 2
  • 9

3 Answers3

1

You are doing spark-submit wrong. The actual command is

./spark-submit --class SparkMeApp --master "local[*]" \
example.jar examplefile.txt

You need to pass --jars only if there is external dependency and you want to distribute that jar to all executors.

If you had enabled the log4j.properties to INFO/WARN you could have easily caught it.

Warning: Local jar /home/user/Downloads/spark-1.4.0/bin/NetSetup.log does not exist, skipping.
Knight71
  • 2,927
  • 5
  • 37
  • 63
  • Thanks for the tip. In the end I removed '--jars' from the command and it worked a treat. – Jonathan Garvey Aug 01 '16 at 19:39
  • Out of curiousity - as I'm still new to all of this - how would one enable the log4j.properties to info/warn? I see there are packages for log4j to do this programatically - though is there an easier way? – Jonathan Garvey Aug 01 '16 at 19:45
0

The text file should be in HDFS (If using HADOOP) or any other DFS you are using to support SPARK in order to pass relative paths for the application to read the data. So, you should put the file into the DFS for you application to work, otherwise only giving the absolute path from your OS file system.

Look here for instructions on how to add files to HDFS, and this related discussion that might help you.

Also, you are setting the clusters to be used by the application twice: in the Spark conf (setMaster("local[*]")):

val conf = new SparkConf().setAppName("SparkMe Application").setMaster("local[*]")

and in the submit (--master "local[*]"):

spark-submit --class SparkMeApp --master "local[*]" --jars target\scala-2.10\firstsparkapplication_2.10-1.0.jar NetSetup.log

You only need to do it once, choose one of them.

Community
  • 1
  • 1
andriosr
  • 481
  • 4
  • 12
  • 1
    I think the --jars flag is expecting several .jar files or directories separated by commas, and after that, the spark-submit script is expecting the jar of the application which in this case it thinks it's NetSetup.log. So you should remove the "--jars" flag. – Marco Aug 01 '16 at 13:30
  • spark-submit works fines with a single jar being passed to the --jars parameter and it expects the arguments for it right after. So, the spark-submit is ok, except from the duplicated clusters setup ive mentioned.. – andriosr Aug 01 '16 at 13:40
  • 1
    --jars flag is used to add extra jars to be transferred to the cluster along with the app jar. The problem in this case is that "target\scala-2.10\firstsparkapplication_2.10-1.0.jar" is being taken as an extra jar and "NetSetup.log" as the application jar.. no argument for the app – Marco Aug 01 '16 at 13:54
  • You guys are right, looking at some of my code I could see that the --jars is actually not necessary with a single jar. Sorry for the misunderstood. Removing the --jars from the spark-submit statement should fix it, as stated in Knight's answer. – andriosr Aug 01 '16 at 14:11
  • 1
    On the ball guys - removing '--jars' from the command worked fine and appears to be only useful for several jars as @Marco pointed out. Keeping the '--jars' in there made spark-submit think that my argument at the end was another jar. Would be nice to see some more documentation on this as it's not quite obvious from the documentation that exists for spark-submit. Also removed the duplicate setMaster to clean things up. Thanks a mil everyone. – Jonathan Garvey Aug 01 '16 at 19:49
0

--Problem solved-- I was making incorrect use of the spark-submit command. By removing '--jars' from the command, the Scala application argument was picked up by spark-submit.

Jonathan Garvey
  • 115
  • 2
  • 9