0

I am reading data from sql server containing 5M rows and upwards which takes about a hour to read and write to parquet using spark in dataproc.

I increased the number of workers for dataproc to 10, increased fetchsize and batchsize 500k and the performance is still very slow.

Is there anyway we can significantly improve reading data from sql server to write its output as parquet using pyspark?

wmorris
  • 27
  • 4
  • Are you partitioning the reads? https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#manage-parallelism Are you monitoring the query performance on SQL Server? https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-ver16 – David Browne - Microsoft Jul 12 '22 at 16:09
  • Thanks David. I've tried partitioning by adding the numPartition option but it still reads on a single partition. The sql server is also very slow in returning all the results for the table. I haven't checked the query performance but will do so as you mentioned it. – wmorris Jul 13 '22 at 18:05

0 Answers0