Query on Yarn and Spark

Question

I need to use spark to export data from Hive(partitioned) to Teradata(non-paritioned).

Cluster spec: 120 worker nodes, each having 16 core processors, 128 GB RAM. Table size is around 130GB and when I am creating a dataframe out of it, it produces 1,30,000 partitions.

val df=spark.sql("select * from <hive-external-table>")

df.mode(SaveMode.Append).write.jdbc(<jdbc_url>,<teradata_table>,<connectionproperties>)

I am launching spark in client mode using below config:

--executor-cores 5 --executor-memory 16G --num-executors 200

The job was running smoothly but it created more than 300 JDBC connections at Teradata end, DBAs panicked and killed them causing my job to fail.

Now, my questions are:

Should I use coalesce to reduce the number of partitions from 1,30,000? If yes, what is optimum and how to choose it?
As I said the job was running pretty fast, it was using around 900 vcores out of the 1500 I requested for: Stage 0> ->-------------------------- (40230+932)/1,30,000
How can I configure this so that my job won't create say more than 100 JDBC connections?
Would it help if the target table at Teradata end is partitioned on the same column? How?

It got that many vcores from Yarn as they were idle, but when other production jobs were submitted, they were running very slow as they were not getting required number of executors.
The most important question: shouldn't Yarn deallocate the executors it had initially provided to my job and allocate them to other jobs? I was running my job in default queue which doesn't have any special priority.
Should I not use spark and use Sqoop instead?

Please let me know if you need any other info of the system and excuse me if I am not clear with my problem as I am not that experienced in spark.

score 1 · Answer 1 · answered Nov 21 '18 at 06:21

I will make an attempt to answer some of the above questions:

There is no generic guideline with respect to coalesce function that can be followed. You need to see which one works better for you in regards to the number of vcores and execution of tasks.
Dynamic allocation functionality of YARN ensures that executors are returned once the initial allocation is done and the executors are not getting used during execution. This helps to free up resources on the cluster and also ensures that we are not hard wiring the cluster resources to only one job.
Spark and Sqoop are two different things. Spark is used for in-memory processing/parallel processing of data (from one or more data sources/distributed systems). Sqoop takes data from one source and loads into another. Comparing these two is not a good idea.

Query on Yarn and Spark

1 Answers1