Spark Scala Multithreading

Question

Is using multi-threading in spark a good idea especially with cache? For example, I have three/four huge files and I want to have 6 output datasets after some filters and joins, I saw a really good performance boost without caching, but using cache is still much faster, I tried to combine cache and multithreading but seems like it doesn't work as I expected, the real problem with last process I mean with second join, seems like it just ignores caching, pseudocode:

val results: Future[Seq[Unit]] = firstInput flatMap { first =>
  secondInput flatMap { second =>
    ThirdInput flatMap { third =>
      Future {
        first.join(second, "id").cache()
      } flatMap { fNs =>

        val process1 = process1(fNs)

        val process2 = process2(fNs)

        val process3 = process3(fNs)

        val process4 = process4(fNs)

        val process5 = process5_(fNs)


        val process6 = fNs.join(third, "id")

        val dfs = Seq(
          process1,
          process2,
          process3,
          process4,
          process5,
          process6
        ).map(x => Future(x.show(100, truncate = false)))

        Future.sequence(dfs)
      }
    }
  }
}

Await.result(results, Inf)

Await.result(results, Inf)

If all i'm doing is wrong please let me know)

I just want to know if is it a good idea to use spark with multithreading and if is it possible to combine it with cache

The point of using Spark is not having to deal with multi threading yourself (among other things), isn't it? — Gaël J, Mar 22 '23 at 19:52
https://stackoverflow.com/questions/43481253/how-to-perform-multi-threading-or-parallel-processing-in-spark-implemented-in-sc — Dmytro Mitin, Mar 23 '23 at 07:45

Spark Scala Multithreading

0 Answers0