Dataproc Spark Data Reads from SQL Server Is Very Slow When Writing Output As Parquet Files

Asked Jul 12 '22 at 15:51

Active Jul 12 '22 at 15:51

Viewed 105 times

I am reading data from sql server containing 5M rows and upwards which takes about a hour to read and write to parquet using spark in dataproc.

I increased the number of workers for dataproc to 10, increased fetchsize and batchsize 500k and the performance is still very slow.

Is there anyway we can significantly improve reading data from sql server to write its output as parquet using pyspark?

asked Jul 12 '22 at 15:51

wmorris

Are you partitioning the reads? https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#manage-parallelism Are you monitoring the query performance on SQL Server? https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-ver16 – David Browne - Microsoft Jul 12 '22 at 16:09
Thanks David. I've tried partitioning by adding the numPartition option but it still reads on a single partition. The sql server is also very slow in returning all the results for the table. I haven't checked the query performance but will do so as you mentioned it. – wmorris Jul 13 '22 at 18:05

0 Answers0