0

Is using multi-threading in spark a good idea especially with cache? For example, I have three/four huge files and I want to have 6 output datasets after some filters and joins, I saw a really good performance boost without caching, but using cache is still much faster, I tried to combine cache and multithreading but seems like it doesn't work as I expected, the real problem with last process I mean with second join, seems like it just ignores caching, pseudocode:

val results: Future[Seq[Unit]] = firstInput flatMap { first =>
  secondInput flatMap { second =>
    ThirdInput flatMap { third =>
      Future {
        first.join(second, "id").cache()
      } flatMap { fNs =>

        val process1 = process1(fNs)

        val process2 = process2(fNs)

        val process3 = process3(fNs)

        val process4 = process4(fNs)

        val process5 = process5_(fNs)


        val process6 = fNs.join(third, "id")

        val dfs = Seq(
          process1,
          process2,
          process3,
          process4,
          process5,
          process6
        ).map(x => Future(x.show(100, truncate = false)))

        Future.sequence(dfs)
      }
    }
  }
}

Await.result(results, Inf)

Await.result(results, Inf)

If all i'm doing is wrong please let me know)

I just want to know if is it a good idea to use spark with multithreading and if is it possible to combine it with cache

  • 2
    The point of using Spark is not having to deal with multi threading yourself (among other things), isn't it? – Gaël J Mar 22 '23 at 19:52
  • https://stackoverflow.com/questions/43481253/how-to-perform-multi-threading-or-parallel-processing-in-spark-implemented-in-sc – Dmytro Mitin Mar 23 '23 at 07:45

0 Answers0