7

My requirement is to

  1. Move data from Oracle to HDFS
  2. Process the data on HDFS
  3. Move processed data to Teradata.

It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.

After searching a lot on the internet, i found that

  1. ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
  2. Do large scale processing either by Custom MapReduce or Hive or PIG.
  3. SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).

Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?

Other options that i found are the following

  1. STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
  2. Any open source ETL tools like Talend or Pentaho.

Please share your thoughts on these options as well and any other possibilities.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
Manikandan Kannan
  • 8,684
  • 15
  • 44
  • 65
  • Sorry, it was the typo. Its actually 50 GB. – Manikandan Kannan Jun 03 '13 at 04:48
  • Storm could work for this, but as you have already figured out there are not prebuilt spouts and bolts for what you need, so you'd have to create them yourself. That said, if you find you can't get your solution to move fast enough, you might want to consider using storm for processing the data since it has lower overhead once it is up and running than using hadoop. – Gordon Seidoh Worley Jun 03 '13 at 22:00

2 Answers2

5

Looks like you have several questions so let's try to break it down.

Importing in HDFS

It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:

sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir

For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.

Processing

As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.

Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.

In Hive you could run your query and tell it to write the results in HDFS:

hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."

Exporting into TeraData

Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:

sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output

The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.

Charles Menguy
  • 40,830
  • 17
  • 95
  • 117
  • Thanks charles. One more question on the processing, what if the processing has complex transformations and flow? Does it make sense to fit Talend or Pentaho here? – Manikandan Kannan Jun 03 '13 at 10:27
1

If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.

I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.

As a disclosure, I work for Dell Software.