Pyspark - Count after load while reading from JDBC connection on some jars taking more time

Question

I am loading data from several databases namely redshift and snowflake into pyspark dataframe using the jdbc driver option.

However, any action after that i.e count or group-by or join takes a long time in case of redshift but is very quick on the snowflake. I am unable to find out the cause of the issue with or without numpartitions, its still taking time to perform any action.

Here is the code which is same on both except the jars being used.

df = spark.read \
                                        .format("jdbc") \
                                        .option("url", jdbcurl) \
                                        .option("driver", "com.amazon.redshift.jdbc42.Driver") \
                                        .option("schema", SchemaName) \
                                        .option("user", user) \
                                        .option("password", password) \
                                        .option("dbtable", "(tablename) tmp") \
                                        .option("partitionColumn", partition_key) \
                                        .option("lowerBound", 1) \
                                        .option("upperBound", 10000000) \
                                        .option("numPartitions", 50) \
                                        .load()

df.count() \\ took 88 secs just for count in redshift, in snowflake it took 10-15 secs

The operations after loading are slow and which impacts the performance of the spark job. What can be the issue, the jars or something else?

Pyspark - Count after load while reading from JDBC connection on some jars taking more time

0 Answers0