1

Background: I am working on a web project to expose analytical data stored on a local MSSQL database. The database is updated regularly. An EMR cluster is responsible to use custom Hive scripts to process raw data from S3 and save the analytical results back to S3. Every time to update the database, the EMR is launched and analytical files are downloaded from S3 to local drive, and then imported into SQL Server tables. Current data flow is:

S3 -> HDFS/Hive-> S3 -> Local Drive -> DB Tables

So we are going to move the DB Server to AWS and make this process automated and faster. We want the new data flow to be:

S3 -> HDFS/Hive -> RDS.

The Hive script is very complex and we have to use it. Hive cannot use RDS as storage location for external tables. I looked at Data Pipeline but it needs to use S3 as intermediate storage and needs lots of setup. I also read about creating a custom map/reduce job that connects RDS via JDBC and do the import. I am willing to study custom mapper/reducer if that is a good solution.

My question is what is a solution that is easy to config (we need to update different tables according to dates), having smaller learning curve(C#/SQL shop and new to EMR, time-constraint) and efficient(no unnecessary data moving) and able to scale to achieve HDFS/Hive -> RDS?

Tzu
  • 235
  • 1
  • 9
  • 2
    Have you considered looking at Sqoop? http://sqoop.apache.org/ It was designed specifically to handle transfer of data between Hadoop and Structured data stores. Your flow may become S3 -> HDFS/Hive -> Sqoop -> RDS – brandon.bell May 29 '15 at 18:41
  • 1
    You can automate this process by using oozie which can start scoop at specified time to take transform the data. – akhil gupta May 29 '15 at 19:24
  • Thank you @brandon.bell! Sqoop looks very interesting. I will try it out. – Tzu May 29 '15 at 19:36
  • I also read about [Hive Storage Handler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers). This seems the ultimate solution because you can just inject any storage handler into Hive DDL. (use Stored By clause and supress Location clause ). Given my budget though, I will evaluate Sqoop first. – Tzu May 29 '15 at 19:41
  • @akhilgupta, I am using EMR and I think I just schedule a "run Sqoop" step right after "running Hive script" step. I will check out oozie if I have scheduling needs later. Thank you for the information. – Tzu May 29 '15 at 19:46
  • would strongly suggest doing in Qubole instead. Most of our customers do this (S3->RDS/Redshift and RDS->S3). We have import / export operators built on top of Sqoop - and take care of all the nitty-gritties. Especially if you don't have too much time to figure this out - can't get any simpler. – Joydeep Sen Sarma Jun 25 '15 at 14:22

0 Answers0