3

I'm trying to load the data in S3 into Aurora MySQL instance. I did it using PySpark and the performance of that is at 4 GB per hour.

current_df.write.format('jdbc').options(
            url=url,
            driver=jdbc_driver,
            dbtable=table_name,
            user=username,
            password=password).mode("overwrite").save()

Added few performance improvements and observed the performance got improved(7 GB per hour), however it's still not that great.

Parameters Added to the JDBC URL

useServerPrepStmts=false&rewriteBatchedStatements=true

I tried another approach

LOAD DATA FROM S3 's3://${s3.bucket}/${filename}' INTO TABLE ${TableName} FIELDS TERMINATED BY ',';

This way it's loading 5 GB per hour into MySQL.

I have close to 2 TB of data needs to load into MySQL instance. Is there any possible way to load the data faster.

data_addict
  • 816
  • 3
  • 15
  • 32

1 Answers1

-1

Some random thoughts; I hope some will help:

  • Increase the allotted IOPS on both sides. (If you can get some metrics, you might be able to decide which side needs more.)
  • What is s3://... -- sounds like reaching into a remote file system? How remote?
  • If it is quite remote, consider zipping and copying the files some closer spot, then unzip (since LOAD does not have an unzip option).
  • Remote... Perhaps you are network bandwidth limited?
  • Run the LOADs in parallel. How many? It is hard to say. Probably the IOPS on one end or the other would be the limiting factor.
  • Do you have an 'excessive' number of indexes on the tables? This could slow down the LOAD. Removing indexes could speed up the LOAD, but you would need to re-add the indexes later. The main point is to get rid of indexes you will not need.
  • Having the source files already sorted in PRIMARY KEY order will spead up LOAD (assuming the target schema already has the PK defined, which it should). Of course, pre-sorting the files takes time, so there is a tradeoff.
Rick James
  • 135,179
  • 13
  • 127
  • 222