I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)