I have a relatively huge on-premise table (~1.5 billion rows) which I am trying to pull into AWS S3 in parquet format using AWS Glue. I am using spark JDBC to read the table and write it to S3. The problem is that I cannot pull all the data from the source table in one go as the source DB would run out of memory and complain. To tackle that I am using the predicates option to push down filters in parallel which works fine to pull data in chunks of 200 million or so. But when I try to write this dataframe to S3 it takes almost half an hour to complete:
df = spark.read.jdbc(url=host_url,
table="TABLENAME",
predicates=predicates,
properties= {
"user" : username,
"password" : password
}
)
So what I want to do is this read from DB stage in sequence:
Read Part 1 from DB --> Read Part 2 from DB --> Read Part 3 from DB
And then write all the data to S3 in parallel
Write Part 1 || Write Part 2 || Write Part 3
Two problems I have with this:
- I have no idea when Spark actually fires those queries to DB. I know it is not when I define the dataframe as shown above, so I am unable to figure out how to serialize stage 1.
- I have looked around and I could not find an option to write multiple dataframes to parquet partitions in parallel. Should I just use python to parallelize the data frame to parquet write operation statements? Is it even advisable to do so?