How to choose the scala version for my spark program?

Question

I am building my first Spark application, developing with IDEA.

In my cluster, the version of Spark is 2.1.0, and the version of Scala is 2.11.8.

http://spark.apache.org/downloads.html tells me:"Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package and build with Scala 2.10 support".

So here is my question:What's the meaning of "Scala 2.10 users should download the Spark source package and build with Scala 2.10 support"? Why not use the version of Scala 2.1.1?

Another question:Which version of Scala can I choose?

Welcome to StackOverflow. We’d love to help you. To improve your chances of getting an answer, here are some tips: https://stackoverflow.com/help/how-to-ask — Dumi, Dec 19 '18 at 09:01
Spark version 2.0 is not build with scala 2.10 anymore. If you want to use scala 2.10 (which in your case is not necessary) you have to download the source package and build it. See this post if it can help. https://stackoverflow.com/questions/39282434/how-to-build-spark-from-the-sources-from-the-download-spark-page — Marouane Lakhal, Dec 19 '18 at 09:16

score 3 · Accepted Answer · answered Dec 19 '18 at 09:18

First a word about the "why".

The reason this subject evens exists is that scala versions are not (generally speacking) binary compatible, although most of the times, source code is compatible.

So you can take Scala 2.10 source and compile it into 2.11.x or 2.10.x versions. But 2.10.x compiled binaries (JARs) can not be run in a 2.11.x environment.

You can read more on the subject.

Spark Distributions

So, the Spark package, as you mention, is built for Scala 2.11.x runtimes.

That means you can not run a Scala 2.10.x JAR of yours, on a cluster / Spark instance that runs with the spark.apache.org-built distribution of spark.

What would work is :

You compile your JAR for scala 2.11.x and keep the same spark
You recompile Spark for Scala 2.10 and keep your JAR as is

What are your options

Compiling your own JAR for Scala 2.11 instead of 2.10 is usually far easier than compiling Spark in and of itself (lots of dependencies to get right).

Usually, your scala code is built with sbt, and sbt can target a specific scala version, see for example, this thread on SO. It is a matter of specifying :

scalaVersion in ThisBuild := "2.10.0"

You can also use sbt to "cross build", that is, build different JARs for different scala versions.

crossScalaVersions := Seq("2.11.11", "2.12.2")

How to chose a scala version

Well, this is "sort of" opinion based. My recommandation would be : chose the scala version that matches your production Spark cluster.

If your production Spark is 2.3 downloaded from https://spark.apache.org/downloads.html, then as they say, it uses Scala 2.11 and that is what you should use too. Using anything else, in my view, just leaves the door open for various incompatibilities down the road.

Stick with what your production needs.

The 2.0+ version of the spark is based on the high version of Scala2.11. If you use the 2.10 version of Scala, some of the newly added libraries in 2.11 do not exist in 2.10. In this case, the 2.0+ spark is compiled. Will it not compile successfully? — Underwood, Dec 19 '18 at 10:57
You got it right, if you go 2.10 (by recompiling spark), then everything scala you "add to the mix" will also have to be 2.10. 2.11 code usually compiles fine in 2.10, but you have to recompile (or find pre-compiled 2.10 binaries) — GPI, Dec 19 '18 at 11:28

How to choose the scala version for my spark program?

1 Answers1

First a word about the "why".

Spark Distributions

What are your options

How to chose a scala version