4

I need to execute a set of different hive queries inside a for loop.

hc=HiveContext(sc)
queryList=[set of queries]
for i in range(0,X):
    hc.sql(queryList[i])
    sparkDF.write.saveAsTable('hiveTable', mode='append')

Though this code works like a charm for smaller X values, it causes issues when X>100. Delay between each saveAsTable job grows exponentially, but each job more or less takes about a constant 5s.

The things I tried to rectify this without any luck:

  1. Add a gc.collect() inside the for loop once (i%100==0). But this breaks the FOR loop
  2. Close the current Spark and Hive context once (i%100==0) and create a new ones - this still doesn't solve the problem
  3. Use yarn-cluster instead of yarn-client - no luck!

Is there something an option like, I create a connection to hive and close it every time I call saveAsTable function? Or to clean up the driver?

Community
  • 1
  • 1
Mike
  • 197
  • 1
  • 2
  • 15

1 Answers1

1

This is happening because you are using for loop which gets executed on spark driver mode not to get distributed on cluster worker node means it's not using the power of parallelism or not executing on worker nodes. try to create RDD using parallelize with a partition which will help to spawn the jobs on the worker node

or if you want to just handle hiveContext you can create global HiveContext like Val hiveCtx = new HiveContext(sc) and reuse inside the loop.

You can also change/optimize the number of executors while running a job on the cluster to improve the performance

Nitin
  • 3,533
  • 2
  • 26
  • 36
  • the job is distributed for sure, because I could see the saveastable job getting executed on all the executors that I have requested. Also, since the table I'm querying is cached in the memory, I could see all the data parts as process local. Btw, I've edited the code to clear the air around the re-usage of hivecontext. The time it takes to run a single query-save is actually only 5 seconds on an average. The issue is with the delay between the successive jobs. – Mike Apr 16 '17 at 23:05
  • in current scenario each save job could be distributed but not iteration on query...iteration list of query still go with sequentially thats why u feel issue is with the delay between the successive jobs..try to parallelize loop on query – Nitin Apr 17 '17 at 00:02
  • I'm fine with the jobs being executed sequentially. The issue is when the delay between each job grows in every iteration. – Mike Apr 17 '17 at 05:49