0

I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!

The coding language is scala.

Susu
  • 121
  • 1
  • 1
  • 5
  • what is the file format ? You can create RDD by val rdd = sc.textFile(yourfilename) – Amit Nov 25 '20 at 17:15

1 Answers1

0

If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).

Sample code for loading function is below:

def load(paths: Seq[String], spark: SparkSession)
        (implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
  def loadSinglePath(path: String): Future[RDD[String]] = Future {
    spark.sparkContext.textFile(path)
  }

  paths map loadSinglePath
}

Sample code for using this function:

import scala.concurrent.duration.{Duration, DurationInt}

val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
      .map { case (rddFuture, idx) =>
        rddFuture.map( rdd =>
          println(s"Rdd with index $idx has ${rdd.count()}")
        )
      }

Await.result(Future.sequence(result), 1 hour)

For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)

  • If what you want is to open multiple files as a single RDD, then there is a better and simpler approach, which is described here: https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd?rq=1 – Mykhailo Kravchenko Nov 27 '20 at 04:20