I need to execute a set of different hive queries inside a for loop.
hc=HiveContext(sc)
queryList=[set of queries]
for i in range(0,X):
hc.sql(queryList[i])
sparkDF.write.saveAsTable('hiveTable', mode='append')
Though this code works like a charm for smaller X values, it causes issues when X>100. Delay between each saveAsTable job grows exponentially, but each job more or less takes about a constant 5s.
The things I tried to rectify this without any luck:
- Add a gc.collect() inside the for loop once (i%100==0). But this breaks the FOR loop
- Close the current Spark and Hive context once (i%100==0) and create a new ones - this still doesn't solve the problem
- Use yarn-cluster instead of yarn-client - no luck!
Is there something an option like, I create a connection to hive and close it every time I call saveAsTable function? Or to clean up the driver?