I need to use spark to export data from Hive(partitioned) to Teradata(non-paritioned).
Cluster spec: 120 worker nodes, each having 16 core processors, 128 GB RAM. Table size is around 130GB and when I am creating a dataframe out of it, it produces 1,30,000 partitions.
val df=spark.sql("select * from <hive-external-table>")
df.mode(SaveMode.Append).write.jdbc(<jdbc_url>,<teradata_table>,<connectionproperties>)
I am launching spark in client mode using below config:
--executor-cores 5 --executor-memory 16G --num-executors 200
The job was running smoothly but it created more than 300 JDBC connections at Teradata end, DBAs panicked and killed them causing my job to fail.
Now, my questions are:
Should I use coalesce to reduce the number of partitions from 1,30,000? If yes, what is optimum and how to choose it?
As I said the job was running pretty fast, it was using around 900 vcores out of the 1500 I requested for: Stage 0> ->-------------------------- (40230+932)/1,30,000
How can I configure this so that my job won't create say more than 100 JDBC connections?
Would it help if the target table at Teradata end is partitioned on the same column? How?
It got that many vcores from Yarn as they were idle, but when other production jobs were submitted, they were running very slow as they were not getting required number of executors.
The most important question: shouldn't Yarn deallocate the executors it had initially provided to my job and allocate them to other jobs? I was running my job in default queue which doesn't have any special priority.
Should I not use spark and use Sqoop instead?
Please let me know if you need any other info of the system and excuse me if I am not clear with my problem as I am not that experienced in spark.