Background: I am working on a web project to expose analytical data stored on a local MSSQL database. The database is updated regularly. An EMR cluster is responsible to use custom Hive scripts to process raw data from S3 and save the analytical results back to S3. Every time to update the database, the EMR is launched and analytical files are downloaded from S3 to local drive, and then imported into SQL Server tables. Current data flow is:
S3 -> HDFS/Hive-> S3 -> Local Drive -> DB Tables
So we are going to move the DB Server to AWS and make this process automated and faster. We want the new data flow to be:
S3 -> HDFS/Hive -> RDS.
The Hive script is very complex and we have to use it. Hive cannot use RDS as storage location for external tables. I looked at Data Pipeline but it needs to use S3 as intermediate storage and needs lots of setup. I also read about creating a custom map/reduce job that connects RDS via JDBC and do the import. I am willing to study custom mapper/reducer if that is a good solution.
My question is what is a solution that is easy to config (we need to update different tables according to dates), having smaller learning curve(C#/SQL shop and new to EMR, time-constraint) and efficient(no unnecessary data moving) and able to scale to achieve HDFS/Hive -> RDS?