Is using multi-threading in spark a good idea especially with cache? For example, I have three/four huge files and I want to have 6 output datasets after some filters and joins, I saw a really good performance boost without caching, but using cache is still much faster, I tried to combine cache and multithreading but seems like it doesn't work as I expected, the real problem with last process I mean with second join, seems like it just ignores caching, pseudocode:
val results: Future[Seq[Unit]] = firstInput flatMap { first =>
secondInput flatMap { second =>
ThirdInput flatMap { third =>
Future {
first.join(second, "id").cache()
} flatMap { fNs =>
val process1 = process1(fNs)
val process2 = process2(fNs)
val process3 = process3(fNs)
val process4 = process4(fNs)
val process5 = process5_(fNs)
val process6 = fNs.join(third, "id")
val dfs = Seq(
process1,
process2,
process3,
process4,
process5,
process6
).map(x => Future(x.show(100, truncate = false)))
Future.sequence(dfs)
}
}
}
}
Await.result(results, Inf)
Await.result(results, Inf)
If all i'm doing is wrong please let me know)
I just want to know if is it a good idea to use spark with multithreading and if is it possible to combine it with cache