1

I'm confused how many connections would be made to the database by Spark in the below scenario:

Let's say I have a Spark program which is running only on one worker node with one executor, and the number of partitions in a dataframe is 10. I want to write this dataframe to Teradata. Since the level of parallelism is 10 but the executor is only 1, will there be 10 connections made while saving the data, or only 1 connection?

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
amit kumar
  • 55
  • 1
  • 7

2 Answers2

1

Since Spark 2.2, the numPartitions parameter specified for a JDBC datasource is also used to control its writing behavior (in addition to previous purpose of setting the level of parallelism during read). From Spark docs:

numPartitions
The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • Thanks @mazaneicha. I too thought on the similar lines but after facing a weird situation lately got confused if my understanding was right. Reading from a Hive table and then writing to Teradata, saw multiple blocking session getting created with no data being written. Then, after reducing the worker nodes to 1, was able to write few thousand records without any issues. But with million records, it started failing due to deadlock. Then after reducing the partitions to 1 by using coalesce(1), the load went fine. – amit kumar Nov 05 '20 at 05:33
  • So, the question is: assuming Teradata doesn't support concurrent writes, how data was getting written after reducing the worker to 1, because in this case also, n connections were made based upon n partitions. Secondly, when reducing the partition to 1, large dataset(~2 million rows) also got written seamlessly. So if concurrent write is a problem, the former use-case should also have failed. – amit kumar Nov 05 '20 at 05:44
  • Not sure, possibly a single node (executor) executes tasks (writes partitions) sequentially? Thatd be the case if you have 1 core per executor. But I would start by checking Teradata configuration for any client connection limits. And firewall's too, if there is any... – mazaneicha Nov 05 '20 at 14:09
0

it depend on your spark -> teradata solution.

in general you will have 1 connection per core. each core iterate over own partitions 1 by 1.

for exemple if you use .foreach with custom solution you will have 1 connection at the time for 1 row.

if you use foreachPartition with custom solution you will have 1 connection for many rows.

maxime G
  • 1,660
  • 1
  • 10
  • 27
  • Hi @maxime-g, Got your point. I understand the number of partitions equals number of cores. Please correct me if i am wrong. But I am not using foreach/foreachPartition and actually using Spark JDBC like here https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html – amit kumar Nov 05 '20 at 07:02