I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any partitions created. I need to read this data in pyspark data frame, and finally write the CSV file into s3. It's taking a long time to run this query on the database, fetch the data and directly write to s3 (the fetched data, based on the query, is around 100MB only).
Can using repartition
on this data frame help me in any way to improve the query performance?
Or is there any other way to make this operation faster?
Asked
Active
Viewed 104 times
0

Sidhant Gupta
- 139
- 14
-
What is taking long time, fetching the data from the db or writing it with spark? – Assaf Segev Mar 30 '22 at 08:55
-
Fetching the data from the database is taking a lot of time. – Sidhant Gupta Mar 30 '22 at 09:34
-
Do you connect with JDBC api? – Assaf Segev Mar 30 '22 at 10:33