1

I'm a spark Scala Programmer. I have a spark job that has sub-tasks which to complete the whole job. I wanted to use Future to complete the subtasks in parallel. On completion of the whole job I have to return the whole job response.

What I heard about scala Future is once the main thread executed and stopped the remaining threads will be killed and also you will get empty response.

I have to use Await.result to collect the results. But all the blogs are telling that you should avoid Await.result and it's a bad practice.

Is using Await.result is correct way of doing or not in my case?

def computeParallel(): Future[String] = {
  val f1 = Future {  "ss" }
  val f2 = Future { "sss" }
  val f3 = Future { "ssss" }

  for {
    r1 <- f1
    r2 <- f2
    r3 <- f3
  } yield (r1 + r2 + r3)
} 

computeParallel().map(result => ???)



To my understanding, we have to use Future in webservice kind of application where it has one process always running that won't be exited. But in my case, once logic execution(scala program) is complete it will exit.

Can I use Future to my problem or not?

Opal
  • 81,889
  • 28
  • 189
  • 210
Learnis
  • 526
  • 5
  • 25

1 Answers1

1

Using futures in Spark is probably not advisable except in special cases, and simply parallelizing computation isn't one of them (giving a non-blocking wrapper to blocking I/O (e.g. making requests to an outside service) is quite possibly the only special case).

Note that Future doesn't guarantee parallelism (whether and how they're executed in parallel depends on the ExecutionContext in which they're run), just asynchrony. Also, in the event that you're spawning computation-performing futures inside a Spark transformation (i.e. on the executor, not the driver), chances are that there won't be any performance improvement, since Spark tends to do a good job of keeping the cores on the executors busy, all spawning those futures does is contend for cores with Spark.

Broadly, be very careful about combining parallelism abstractions like Spark RDDs/DStreams/Dataframes, actors, and futures: there are a lot of potential minefields where such combinations can violate guarantees and/or conventions in the various components.

It's also worth noting that Spark has requirements around serializability of intermediate values and that futures aren't generally serializable, so a Spark stage can't result in a future; this means that you basically have no choice but to Await on the futures spawned in a stage.

If you still want to spawn futures in a Spark stage (e.g. posting them to a web service), it's probably best to use Future.sequence to collapse the futures into one and then Await on that (note that I have not tested this idea: I'm assuming that there's an implicit CanBuildFrom[Iterator[Future[String]], String, Future[String]] available):

def postString(s: String): Future[Unit] = ???

def postStringRDD(rdd: RDD[String]): RDD[String] = {
  rdd.mapPartitions { strings =>
    // since this is only get used for combining the futures in the Await, it's probably OK to use the implicit global execution context here
    implicit val ectx = ???
    Await.result(strings.map(postString))
  }
  rdd  // Pass through the original RDD
}
Levi Ramsey
  • 18,884
  • 1
  • 16
  • 30