-1

Now I got this question which might need some input from folks.

I have a traditional rdbms (e.g. sql server) storing data which is updated daily. And I want to leverage Hadoop ecosystem, so the first step might be to export data from our rdbms to a Hadoop data store. I would also like to do an incremental update for the data store, probably nightly. I'm not sure which one to use.

It seems Hive is not the right option when we consider about incremental update (correct me if this assumption is not right), then what about HBase, or any other appropriate option?

Lubor
  • 989
  • 3
  • 10
  • 33
  • option like impala is platform specific. I would suggest you to do a POC and find out which is best suit for you. please see my updated answer. – Ram Ghadiyaram Jun 15 '16 at 04:10

2 Answers2

1

The Hadoop ecosystem includes a tool called Sqoop that is designed to do exactly the problem you describe: pull data from an RDBMS into Hadoop. It supports several methods of doing incremental updates. It requires a JBDC or JNDI connection to your database, and for some databases is able to use high-performance options. It's one of the better tools in Hadoop.

When I say "into Hadoop" this could mean several things, but typically either a) as a set of files stored on the Hadoop distributed file system (HDFS), or b) data stored in hBase. And technically, hBase is just another way of storing files on HDFS.

Hive is a layer on top of HDFS that allows you to treat the RDBMS tables that you export to HDFS file as though they were still on your SQL Server database. Well, kinda. Hive can query a number of file formats using a SQL-like language.

HDFS has one particular challenge that you need to understand: there's no way to update a row, as there is in a regular database. An HDFS file is a "write once-read many" design. Typically, you can segment a dataset into multiple files along some natural partition, such that if you do need to update a record, you need only rewrite the files associated with the partition -- year + month is a common partitioning scheme.

So if you're Sqoop'ing a database whose records never change, then you can simply append to your HDFS file. This is fine for transactions, or logs or other data like that, as it typically never gets changed. But records that are updated (a customer name or email, for example) makes a more difficult problem.

hBase has made this HDFS limitation go away by transparently managing updates to existing records. But hBase is a key-value store database; the key may be whatever your RDBMS's primary key is, and value needs to be the rest of the record. This isn't terrible, but it can be cumbersome.

I believe the latest versions on Hive (or possibly Impala, which is similar in function to Hive) allow updates, while still storing data in the more flexible formats.

So Sqoop is the tool you want, but think carefully about what you'll want to do with the data once it's in Hadoop -- it's a very, very different thing than just a database that can get really big.

Tom Harrison
  • 13,533
  • 3
  • 49
  • 77
0

1) Data ingestion : You can consider SQOOP

2) Incremental updates :

You can use hbase for incremental updates, with hive external tables(hbase storage handler).

Please see https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

also have a look in this my answer

Options like Impala is platform-specific(cloudera) which can re-use hive meta store. But faster in query processing than Hive.

There are some key features in impala that makes its fast.

  1. It does not use map/reduce which are very expensive to fork in separate jvms. It runs separate Impala Daemon which splits the query and runs them in parallel and merge result set at the end.

  2. It does most of its operation in-memory.

  3. It uses hdfs for its storage which is fast for large files. It caches as much as possible from queries to results to data.

  4. It supports new file format like parquet, which is columnar file format. So if you use this format it will be faster for queries where you are accessing only few columns most of the time.

Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121