Load S3 data into Aurora MySQL instance

Question

I'm trying to load the data in S3 into Aurora MySQL instance. I did it using PySpark and the performance of that is at 4 GB per hour.

current_df.write.format('jdbc').options(
            url=url,
            driver=jdbc_driver,
            dbtable=table_name,
            user=username,
            password=password).mode("overwrite").save()

Added few performance improvements and observed the performance got improved(7 GB per hour), however it's still not that great.

Parameters Added to the JDBC URL

useServerPrepStmts=false&rewriteBatchedStatements=true

I tried another approach

LOAD DATA FROM S3 's3://${s3.bucket}/${filename}' INTO TABLE ${TableName} FIELDS TERMINATED BY ',';

This way it's loading 5 GB per hour into MySQL.

I have close to 2 TB of data needs to load into MySQL instance. Is there any possible way to load the data faster.

score -1 · Answer 1 · answered May 16 '18 at 13:44

Some random thoughts; I hope some will help:

Increase the allotted IOPS on both sides. (If you can get some metrics, you might be able to decide which side needs more.)
What is s3://... -- sounds like reaching into a remote file system? How remote?
If it is quite remote, consider zipping and copying the files some closer spot, then unzip (since LOAD does not have an unzip option).
Remote... Perhaps you are network bandwidth limited?
Run the LOADs in parallel. How many? It is hard to say. Probably the IOPS on one end or the other would be the limiting factor.
Do you have an 'excessive' number of indexes on the tables? This could slow down the LOAD. Removing indexes could speed up the LOAD, but you would need to re-add the indexes later. The main point is to get rid of indexes you will not need.
Having the source files already sorted in PRIMARY KEY order will spead up LOAD (assuming the target schema already has the PK defined, which it should). Of course, pre-sorting the files takes time, so there is a tradeoff.

Load S3 data into Aurora MySQL instance

1 Answers1