I need to get data based on time range.Is there any way to partition hbase table based on the time range. Ex : I want data say from 9:00 to 9:05 .
3 Answers
You can create a compound key of the type <timestamp><id>
, and then all entries in hbase will be ordered by timestamp. You can then create a scanner that starts at the beginning of the range and ends at the end of the range.
One issue that you may face is that if you have a high insert rate, you will have a single server be the hotspot for all new entries. One way around that is to invert the key and ensure that the first part is random: <sha1 of ID><timestamp>
. This has the advantage of distributing the writes across the entire cluster, but the disadvantage of requiring a read of the entire table to get a particular range.
If you use the first method of <timestamp><id>
, then your map job may not be able to split the work up into as many chunks as you might like. The default way table splits work is on region. If you time slice is small enough, you could have a single region serving the data and not gain any parallelism in your query. You could potentially have a custom table split that parallelizes the query across more mappers than regions, but you would still be reading all of the data from one region, and that can have drawbacks for parallelism as well.
How you set up your table depends on your projected usage scenario and the read/write proportion, and how high of a performance you need for each.
If you append an id to your timestamp to ensure uniqueness, then you can still get a scanner to return all events with the given timestamp. HBase sorts keys lexographically based on the byte representation. So, if your key is <timestamp>:<id>
, you can set your scanner to start at row <timestamp>
and stop row at <timestamp+1>
to get all events at that timestamp
You could make the timestamp the first part of your key. Obviously, the disadvantage is that you can no longer query directly for other keys. If you need that too, you could consider to duplicate your data, if both of these are important to you.
For me the problem is duplicate entries. I canhave many events occurring at the same time. For example : I can have 10 events occurring at say 10:05. If I convert it into epoch time and insert it, they can overwrite each other ( or fail to write ) in hbase.
I can append an ID along with timestamp, but can I set start and end time in mapreduce job if I add this ID ?

- 1