0

I need to write pyspark dataframe to the Azure SQL database. Df has 300 000 000 records and the jdbc connector is not able to do it in a short time.

Dataframe is a select from delta table and join with SQL lookups. What I've done:

  1. partition delta by the column which is in the WHERE (date).
  2. ZORDER delta by the column which is used in the JOIN.
  3. Create clustered index on the LKPs SQL tables (columns used in the JOIN).
  4. Cache final dataframe.
  5. Write dataframe by using repartition(1000).

Do you know other performance solutions which will improve writing df to the sql database?

inspiredd
  • 195
  • 2
  • 11

1 Answers1

0

Partitioning and Parallelism are the two key approaches we need to implement while loading large dataframes to SQL Server.

In the Spark UI, you can see that the number of partitions dictate the number of tasks that are launched. Each task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC interface.

You can use two DataFrameReader APIs to specify partitioning:

  • jdbc(url:String,table:String,columnName:String,lowerBound:Long,upperBound:Long,numPartitions:Int,...) takes the name of a numeric column (columnName), two range endpoints (lowerBound, upperBound) and a target numPartitions and generates Spark tasks by evenly splitting the specified range into numPartitions tasks. This work well if your database table has an indexed numeric column with fairly evenly-distributed values, such as an auto-incrementing primary key; it works somewhat less well if the numeric column is extremely skewed, leading to imbalanced tasks.

  • jdbc(url:String,table:String,predicates:Array[String],...) accepts an array of WHERE conditions that can be used to define custom partitions: this is useful for partitioning on non-numeric columns or for dealing with skew. When defining custom partitions, remember to consider NULL when the partition columns are Nullable. Don’t manually define partitions using more than two columns since writing the boundary predicates require much more complex logic.

JDBC writes

Spark’s partitions dictate the number of connections used to push data through the JDBC API. You can control the parallelism by calling coalesce(<N>) or repartition(<N>) depending on the existing number of partitions. Call coalesce when reducing the number of partitions, and repartition when increasing the number of partitions.

Refer the python example given here to implement the same.

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14