I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
- pros: easy to maintain
- Cons: all resources may get occupied and performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
- pros:optimization can be done at query level
- cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is: "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance