Java - Apache Spark communication

Question

I'm quite new to Spark and was looking for some guidance :-)

What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.

My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:

The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.

Maybe the "provided" scope in Maven would help me but I'm using ant.

Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.

Am I missing any middle-of-the-road approach?

score 2 · Accepted Answer · answered Jun 06 '15 at 01:09

2

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

answered Jun 06 '15 at 01:09

Holden

7,392
1
27
33

Thank you. I think we're going for Spring XD to integrate with Spark and other technologies. – Javier Moreno Garcia Jun 09 '15 at 21:19

score 2 · Answer 2 · answered Jun 08 '16 at 09:07

There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.

I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.

Mist supports Scala and Python jobs execution.

The quick start is following:

Add Mist wrapper into your Spark job:
Scala example:

object SimpleContext extends MistJob {
    override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
        val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
        val rdd = context.parallelize(numbers)
        Map("result" -> rdd.map(x => x * 2).collect())
    }
}

Python example:

import mist
class MyJob:
    def __init__(self, job):
        job.sendResult(self.doStuff(job))
    def doStuff(self, job):
        val = job.parameters.values()
        list = val.head()
        size = list.size()
        pylist = []
        count = 0
        while count < size:
            pylist.append(list.head())
            count = count + 1
            list = list.tail()
        rdd = job.sc.parallelize(pylist)
        result = rdd.map(lambda s: 2 * s).collect()
        return result

if __name__ == "__main__":
    job = MyJob(mist.Job())

Run Mist service:

Build the Mist

git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark

Create configuration file

mist.spark.master = "local[*]"
mist.settings.threadNumber = 16

mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003

mist.mqtt.on = false

mist.recovery.on = false

mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false

mist.contextDefaults.sparkConf = {
    spark.default.parallelism = 128
    spark.driver.memory = "10g"
    spark.scheduler.mode = "FAIR"
}

Run

spark-submit    --class io.hydrosphere.mist.Mist \
                --driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar

Try curl from terminal:

curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Java - Apache Spark communication

2 Answers2