How to export a 2TB table from a RDS instance to S3 or Hive?

Question

I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.

The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).

I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.

Which one is the most recommended for this task? And what strategy do you think is more efficient?

score 0 · Answer 1 · answered Oct 02 '17 at 14:01

You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.

Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.

val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive

How to export a 2TB table from a RDS instance to S3 or Hive?

1 Answers1