3

From the Spark downloads page, if I download the tar file for v2.0.1, I see that it contains some jars that I find useful to include in my app.

If I download the tar file for v1.6.2 instead, I don't find the jars folder in there. Is there an alternate package type I should use from that site? I am currently choosing the default (pre-built for Hadoop 2.6). Alternately, where I can find those Spark jars - should I get each of them individually from http://spark-packages.org?

Here is an indicative bunch of jars I want to use:

  • hadoop-common
  • spark-core
  • spark-csv
  • spark-sql
  • univocity-parsers
  • spark-catalyst
  • json4s-core
sudheeshix
  • 1,541
  • 2
  • 17
  • 28

1 Answers1

9

The way Sparks ships its run-time has changed from V1 to V2.

  • In V2, by default, you have multiple JARs under $SPARK_HOME/jars
  • In V1, by default, there was just one massive spark-assembly*.jar under $SPARK_HOME/lib that contains all the dependencies.

I believe you can change the default behavior, but that would require recompiling Spark on your own...

And also, about spark-csv specifically:

  • In V2, the CSV file format is natively supported by SparkSQL
  • In V1, you have to download spark-csv (for Scala 2.10) from Spark-Packages.org plus commons-csv from Commons.Apache.org and add both JARs to your CLASSPATH
    (with --jars on command line, or with prop spark.driver.extraClassPath + instruction sc.addJar() if the command line does not work for some reason)
    ...and the syntax is more cumbersome, too


Excerpt from the vanilla $SPARK_HOME/bin/spark-class as of Spark 2.1.x (greatly simplified)

# Find Spark jars

  SPARK_JARS_DIR="${SPARK_HOME}/jars"
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"

And as of Spark 1.6.x

# Find assembly jar

  ASSEMBLY_DIR="${SPARK_HOME}/lib"
  ASSEMBLY_JARS="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" || true)"
  SPARK_ASSEMBLY_JAR="${ASSEMBLY_DIR}/${ASSEMBLY_JARS}"
  LAUNCH_CLASSPATH="$SPARK_ASSEMBLY_JAR"
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • in spark 2.2.0 dropping the jars in `$SPARK_HOME/jars` seems to make the jars available to `spark-shell` and `pyspark` via terminal, however when I submit a spark app those jars are not picked up, i get ClassNotFound excpt, do I need to specify anything else when building my spark context? – perrohunter Jan 09 '18 at 20:09
  • What do you mean exactly by _"submit a spark app"_ >> does that imply `spark-submit` shell that invokes `spark-class` among many other things? Otherwise, you'll have to reverse-engineer the whole mess - good luck with that. – Samson Scharfrichter Jan 09 '18 at 21:54
  • In case someone gets here from Google searching the `jars` folder in AWS EMR, in spark 2.x it is in `/usr/lib/spark/jars/`. Check [this tutorial](https://aws.amazon.com/pt/premiumsupport/knowledge-center/emr-permanently-install-library/) from AWS to see more info. – Daniel Lavedonio de Lima May 15 '21 at 07:32