0

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)

Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.

- First approach

All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading

  • pros: easy to maintain
  • Cons: all resources may get occupied and performance tuning can be tough for each query.

- Second approach

using oozie spark actions run each query individual

  • pros:optimization can be done at query level
  • cons: tough to maintain.

I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?

The only thing on Spark multithreading I could found is: "within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"

Thanks in advance

user2895589
  • 1,010
  • 4
  • 20
  • 33

1 Answers1

0

Since your requirement is to run hive queries in parallel with the condition

Some can run parallel and some sequential

This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.

rogue-one
  • 11,259
  • 7
  • 53
  • 75
  • lets assume if I can run all of the queries (queries which takes avg 7-8 min each to complete in hive) in parallel will multithreading be the good choice ? – user2895589 Feb 04 '17 at 13:42
  • yes.. if all queries have to run in parallel without any dependencies with other queries than a parallel execution instead of Dag processor like Oozie would make sense.. – rogue-one Feb 04 '17 at 13:50