I am reading data from sql server containing 5M rows and upwards which takes about a hour to read and write to parquet using spark in dataproc.
I increased the number of workers for dataproc to 10, increased fetchsize and batchsize 500k and the performance is still very slow.
Is there anyway we can significantly improve reading data from sql server to write its output as parquet using pyspark?