Scala distributed execution of function objects

Question

Given the following function objects,

val f : Int => Double = (i:Int) => i + 0.1

val g1 : Double => Double = (x:Double) => x*10

val g2 : Double => Double = (x:Double) => x/10

val h : (Double,Double) => Double = (x:Double,y:Double) => x+y

and for instance 3 remote servers or nodes (IP xxx.xxx.xxx.1, IP 2 and IP 3), how to distribute the execution of this program,

val fx = f(1)
val g1x = g1( fx )
val g2x = g2( fx )
val res = h ( g1x, g2x )

so that

fx is computed in IP 1,
g1x is computed in IP 2,
g2x is computed in IP 3,
res is computed in IP 1

May Scala Akka or Apache Spark provide a simple approach to this ?

Update

RPC (Remote Procedure Call) Finagle as suggested by @pkinsky may be a feasible choice.
Consider load-balancing policies as a mechanism for selecting a node for execution, at least any free available node policy.

@ElectricCoffee no special reason, the question may well be stated for methods, truly a choice if it leads to a good solution :) — elm, Aug 18 '14 at 18:37
Do you want to serialize your functions, send them to remote servers, have the remote servers execute them, serialize the results, and return them to you? Or do you just need an RPC library? If it's the second, check out twitter's open-source Finagle library. — pkinsky, Aug 18 '14 at 18:40
@pkinsky many thanks for the ideas, novel in this, unsure which are the qualities of each option... — elm, Aug 18 '14 at 18:43
@pkinsky after quick check at RPC Finagle, it looks very promising... — elm, Aug 18 '14 at 18:46
one of the great things about writing functional type operations (map, fold, reduce) on collections in languages like scala is that you can write the logic and then with sometimes almost no work put any backend behind it (serial, parallel collections, apache-spark). So my recommendation is to write your code like that and then evaluate the results of using various backends — aaronman, Aug 18 '14 at 18:54
@enzyme you almost certainly want the second option. Check out twitter's intro to the topic & step-by-step distributed search engine project. http://twitter.github.io/scala_school/finagle.html http://twitter.github.io/scala_school/searchbird.html — pkinsky, Aug 18 '14 at 23:19
@enzyme What mechanism should determine where the computation takes place? — EECOLOR, Nov 01 '14 at 01:52
@EECOLOR (any) load-balancing policy, at least "free available node" policy. — elm, Nov 01 '14 at 07:42
@enzyme If you have all the functions present at all of the nodes you could use Akka where the message contains the name of the method and the parameters. If you want to send functions, you might look into the Spores project which aims (in combination with Pickles) to safely serialize functions to be executed elsewhere. — EECOLOR, Nov 03 '14 at 19:42
@EECOLOR many thanks, looking forward to a draft/example on this, looks highly promising :) how to pass a message with function name and (variable number) of arguments ? — elm, Nov 04 '14 at 08:35
@enzyme I probably used the wrong words. You can have `Method1(arg1, arg2)` and `Method2(arg1)` as messages. Then in your `receive` method you execute the correct method. — EECOLOR, Nov 04 '14 at 19:38

Daniel Darabos · Accepted Answer · 2014-11-05T09:06:28.163

I can speak for Apache Spark. It can do what you are looking for with the code below. But it's not designed for this kind of parallel computation. It is designed for parallel computation where you also have a large amount of parallel data distributed on many machines. So the solution looks a bit silly, as we distribute a single integer across a single machine for example (for f(1)).

Also, Spark is designed to run the same computation on all the data. So running g1() and g2() in parallel goes a bit against the design. (It's possible, but not elegant, as you see.)

// Distribute the input (1) across 1 machine.
val rdd1 = sc.parallelize(Seq(1), numSlices = 1)
// Run f() on the input, collect the results and take the first (and only) result.
val fx = rdd1.map(f(_)).collect.head
// The next stage's input will be (1, fx), (2, fx) distributed across 2 machines.
val rdd2 = sc.parallelize(Seq((1, fx), (2, fx)), numSlices = 2)
// Run g1() on one machine, g2() on the other.
val gxs = rdd2.map {
  case (1, x) => g1(x)
  case (2, x) => g2(x)
}.collect
val g1x = gxs(0)
val g2x = gxs(1)
// Same deal for h() as for f(). The input is (g1x, g2x), distributed to 1 machine.
val rdd3 = sc.parallelize(Seq((g1x, g2x)), numSlices = 1)
val res = rdd3.map { case (g1x, g2x) => h(g1x, g2x) }.collect.head

You can see that Spark code is based around the concept of RDDs. An RDD is like an array, except it's partitioned across multiple machines. sc.parallelize() creates such a parallel collection from a local collection. For example rdd2 in the above code will be created from the local collection Seq((1, fx), (2, fx)) and split across two machines. One machine will have Seq((1, fx)), the other will have Seq((2, fx)).

Next we do a transformation on the RDD. map is a common transformation that creates a new RDD of the same length by applying a function to each element. (Same as Scala's map.) The map we run on rdd2 will replace (1, x) with g1(x) and (2, x) with g2(x). So on one machine it will cause g1() to run, while on the other g2() will run.

Transformations run lazily, only when you want to access the results. The methods that access the results are called actions. The most straightforward example is collect, which downloads the contents of the entire RDD from the cluster to the local machine. (It is exactly the opposite of sc.parallelize().)

You can try and see all this if you download Spark, start bin/spark-shell, and copy your function definitions and the above code into the shell.

Thanks a heap, can you detail a bit further on `Seq((1, fx), (2, fx)` and subsequent `map` ? — elm, Nov 05 '14 at 05:54
I've added a few paragraphs to explain. Let me know if you still have questions! — Daniel Darabos, Nov 05 '14 at 09:07

Scala distributed execution of function objects

1 Answers1