Partitioning and Parallelism are the two key approaches we need to implement while loading large dataframes to SQL Server.
In the Spark UI, you can see that the number of partitions dictate the number of tasks that are launched. Each task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC interface.
You can use two DataFrameReader APIs to specify partitioning:
jdbc(url:String,table:String,columnName:String,lowerBound:Long,upperBound:Long,numPartitions:Int,...)
takes the name of a numeric column (columnName
), two range endpoints (lowerBound
, upperBound
) and a target numPartitions
and generates Spark tasks by evenly splitting the specified range into numPartitions
tasks. This work well if your database table has an indexed numeric column with fairly evenly-distributed values, such as an auto-incrementing primary key; it works somewhat less well if the numeric column is extremely skewed, leading to imbalanced tasks.
jdbc(url:String,table:String,predicates:Array[String],...)
accepts an array of WHERE
conditions that can be used to define custom partitions: this is useful for partitioning on non-numeric columns or for dealing with skew. When defining custom partitions, remember to consider NULL
when the partition columns are Nullable. Don’t manually define partitions using more than two columns since writing the boundary predicates require much more complex logic.
JDBC writes
Spark’s partitions dictate the number of connections used to push data through the JDBC API. You can control the parallelism by calling coalesce(<N>)
or repartition(<N>)
depending on the existing number of partitions. Call coalesce when reducing the number of partitions, and repartition when increasing the number of partitions.
Refer the python example given here to implement the same.