Multiple timestamp as column-names: Timeseries data design for quick data retrieval using BigData

Question

 ID             Timestamp1 Timestamp2 Timestamp3 Timestamp4 Timestamp5

 101003978854       10.1     34.2        23.5        19.36      28.05
 101003998120       21.19    15.09       13.24       21.86      10.34
 109721347573       13.76    26.8        10.09       31.12      27.43

Above database structure is the one which I am interested in using Hbase. I know Hbase query using singlecolumnname or singlecolumnvalue filter is effective for less no. of column filters. But I am interested to get a time range query like 10 am to 11 am data for a particular ID.

Let me know how to achieve this. Or is there any better way to achieve similar with other technologies in open source big data stacks.

Thanks

score 0 · Answer 1 · edited Oct 15 '17 at 19:36

HBase can perform good with less no. of column families and any no. of columns for seeks.If the schema is well-designed you can also do range scan very efficiently without need for filters and thus introducing inefficiency.

If you want to query a particular ID,making it the rowkey is a good idea. But it will not be a good idea to go with columns as you suggested as it will not be possible to get columns based on a range.

However in this situation,you can go with following approach,

rowKey(timestamp and ID) colum1(Counters,very good for highly concurrent data aggregation) column2 ........

10.1ID1(as byte array)     1000 100...
10.1ID2                    100  1000..
10.2ID1                    10   100...
10.2ID2                    5    20....

Now if you want to scan on a particular timerange(say 10-11) then you can do a scan with partial start rowkey(10.0) and partial end rowkey(10.9) for all ids. For one particular id(say ID1),you can use start rowkey as 10.0ID1 and end as 10.9ID1.

If you want to scan for a range of IDs,then it would be better to have as rowKey.

Maintain lesser columns if you want to filter the scan results. Also for lesser no. of rows(as scan is intended),keep timestamp as hour,day,month whichever suits your requirement.

For scans,it is also best to distribute data evenly across cluster nodes so that scans are faster as they will be carried out parallely on the regions.Refer Hbase presplit keys strategy

Hbase works very good with good schema and rowkey design and from experience of using alternatives and similar usecases,I can assure it is one of the best out there.

Multiple timestamp as column-names: Timeseries data design for quick data retrieval using BigData

1 Answers1