4

We are planning to migrate from CDH3 to CDH4, as part of this migration we also planning to bring HBASE into out system because it also updates to the data, in CDH3 we are using Hive as warehouse.

Here we are having the major problem in migration, Hive supports partitions to tables. And our system has many tables in different schemas and some tables has partitions base on date, we have the history of data from last 5 years (365 * 5 partitions exists in some tables).

We want to achieve the same behavior in HBase also, when I browsed I couldnt find the solution for creating partitions in HBase. Can any one help me in implementing this partition wised table creation in HBase.

The reason we are going for HBASE is, it supports updates.

If HBASE is not supporting this which is other (like MangoDB, Cassandra) supports our behavior.

Its really great help if we can find at least some work around solutions also.

GHK
  • 241
  • 1
  • 8
  • 19

2 Answers2

4

HBase has a notion close to partition which is called a region. however These partitions in HBase don't work like Hive (or RDBMS) partitions. Each region holds a range of keys but you can break a key range into smaller regions by splitting or dividing it - e.g. if your original region holds keys 0-9 you can divide it to two smaller regions 0-4 and 5-9 or ten partitions 0,1,2... etc.

If your key would be composite so that the date would be the first part of it followed by whatever your key is today you can pre-split hbase so that each day would get one or more regions.

You should note, however, that a key where the most significant bytes are sequential will slow down your writes (may not be a problem if you're doing one-time loads) a problem called "hot spot" - you can read about it and a sample approach overcoming it in a blog post by Alex Baranau from sematext

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • Lets say I have a schema called 'demoschema', and it is having a table called 'transtable' and it is going to have daily partitions, where each partition has more than 10 million records, can you please brief me how this is going to be fit in HBASE. – GHK Sep 26 '13 at 06:45
  • How many years do you want to store in that table? what's your current key? – Arnon Rotem-Gal-Oz Sep 26 '13 at 08:26
  • currently in Oracle, TID is the primary key, when we bring that data to Hive we will put in date wise partition. And currently we have the history of 5 years data in Hive as daily partition. So we have around 365 * 5 partitions in the system in Hive, and daily it will create one new partition and puts 10 million new records into that partition. – GHK Sep 26 '13 at 08:52
  • Ok, and how many updates do you have ? is it only for new data or also for older one (if only new data you can put a few days worth of data in HBase and then export that to hive when it stabilize). Also what sort of retention you're looking for? – Arnon Rotem-Gal-Oz Sep 27 '13 at 10:48
  • We are going to maintain history of data and new data. There is a possibility that we may need to update the data @2013-01-15 date also some times. So we need to keep this data always in HBASE. – GHK Sep 28 '13 at 01:27
  • In that case you'd probably need a new table per year or something like that. – Arnon Rotem-Gal-Oz Sep 28 '13 at 03:57
0

I'm afraid you can't partition data in HBase like you do in Hive. Both these tools are quite different from each other both in design and behavior. Data in HBase is kinda already partitioned for you, since HBase partitions the key space and each partition is what we call a table. If you still need more fine grained partitioning, you could achieve that by using column families wisely.

For example, you could have a column family for each year. So, you would be having a table with 5 column families.


Edit :

If you need something like what you have mentioned in your last comment, you can create a pre-splitted table. You can choose the start and end rowkeys for the regions as per your convenience. Like, one partition for each day where the first and the last entries on that day will be the start row and end boundaries for that particular region respectively.

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • Good to see your reply, but in one day we have almost 10+ millions of records, so we want to have partitions on day wise, this will create so many column families, which is not a good design to go. We are looking for day wise partition for each schema. Is there any way? – GHK Sep 25 '13 at 10:14
  • What would be the catch if you push each of these records as a row, where rowkey would be prefixed with a particular day? – Tariq Sep 25 '13 at 11:43
  • Good one, so one row key will have 10+ millions of records, is it possible to retrieve one record(actual primary key id in oracle is ID) with primary key id and update it. How fast we can retrieve that record in this case. – GHK Sep 26 '13 at 01:51
  • Or u can also use unix timestamp as rowkey where u can differentiate each day by converting timestamp to date. But it can lead to slowdown your process. – piyush pankaj Sep 26 '13 at 04:48
  • This may not be efficient one. – GHK Sep 26 '13 at 06:40
  • I was suggesting to push every record as a new row, where rowkey could be the timestamp+running-sequence. So, for each day you would have 10+M rows. Now, if you want to fetch all the records for a particular day you can set a filter on the timestamp part of the row key for that particular day. Or, to make it faster you can do a range query as well. – Tariq Sep 26 '13 at 07:01
  • Or use that particular date as a string+a running sequence. – Tariq Sep 26 '13 at 07:12
  • This is the design I had thought, but I felt having so many records with combination date and pkid, I felt it will slow down the retrivel when compared to Hive partition. And hence I am looking for some other solution. – GHK Sep 26 '13 at 09:00
  • Range query should not be very slow, IMHO. – Tariq Sep 26 '13 at 09:12
  • If I would like to achieve the same behavior for different data base schema (like schema1, schema2,...), how to achieve it. I didnt find a solution to create data base schema in HBASE, is it also like prepend schema to date + ID for row key. – GHK Sep 26 '13 at 09:17
  • Could you please elaborate it some more? – Tariq Sep 26 '13 at 09:19
  • what i mean is, same table can exists in different data base schemas, in hive we can create schemas like create database 'schema1', then use schema1, then we can create tables with partitions. So in HBASE we can do the same think like schema + date partition + PKID as row key, is this the way we can achieve right? or any other way? – GHK Sep 26 '13 at 09:28
  • oh..absolutely..you can do that – Tariq Sep 26 '13 at 09:52
  • but we cant create schema in hbase like create datatbase 'schema1' rught? what would be the reason that they didnt support? – GHK Sep 26 '13 at 10:28
  • Yes, we can't. Hive was developed to serve the purpose of a data warehouse wherein you can have multiple databases residing inside your warehouse. On the other hand, HBase is a database. Both are used for different needs. Pl see the edited answer. – Tariq Sep 26 '13 at 18:21
  • Ok, I got it, what is your opinion on choosing Cassandra or Mongo db for our use case? if you have any idea on those two, compared to HBASE. – GHK Sep 27 '13 at 02:08
  • It largely depends on the kind of data you have and the processing you are gonna perform on that data. The best way to choose one would be to test it for yourself. – Tariq Sep 27 '13 at 08:42