0

I would like to ask you if the current schema design on a HBase table is correct for the following scenario: I receive 10 million events per day each having a unix epoch timestamp and an id. I will have to group by day, so that I can easily scan for those events that happened on a specific day.

Current design: Events timestamp is converted to a format "MM-YYYY_DD" string as key and each id of an event that occurred on that day is stored in the row. This will result in up to 10 million columns in one row. As far as I understand HBase there is a lock on writing on a single row. Resulting in having many locks when importing a single day and decreasing performance.

Maybe this would be a better design?: Use the unix epoch timestamp as a row's key resulting in many rows with several thousand columns (several events may occurring on the same second, because my timestamp has a max. resolution of one second). When scanning one can calculate the start and end time in unix epoch and do the scan.

Matthias Mueller
  • 102
  • 2
  • 12

2 Answers2

0

I'll just list some knowledge on hbase, it might be useful for you to make decision how to better amend your design.

HBase is column based distributed database. It's distributing records across different nodes based on prefix of row key. So depends how many nodes you have, in your case it will work in the following way: records for different months will go to different nodes (all data for all days of specific month will go to single node).

In the same time it's ok to have long row key (with eventid suffix) which most likely will not affect distribution a lot. HBase allows to build scan queries based on prefix of row key, but not exact match.

vvg
  • 6,325
  • 19
  • 36
  • Thanks for the hint about distribution. Maybe a combination of both keys would make sense, so that accessing one day will not cause contacting ever single region server. – Matthias Mueller May 25 '17 at 18:21
0

HBase is best used for faster random reads and writes. Anythinig other than that, you have to pay extra caution. In your case keeping the row key as day is very bad because, as you said, it will result in millions of columns. Its not good practise. Mostly you might end up into memory issues when holding such large rows.

You want grouping/partitioning - then using scan with filter is not a bad approach. You can query based on a column with "SingleColumnValueFilter". Performnce will not be optimal compared to rowkey scan. Again, I am not sure whats the response time you are expecting.

Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • But doing a row scan over several regions will result in poor performance too, won't it? – Matthias Mueller May 31 '17 at 19:41
  • Yes, as we know, HBase is a columnar database. So once again it works best if we scan only few columns rather than all. – Ramzy May 31 '17 at 20:18
  • But as you can read here https://www.quora.com/Is-there-a-limit-to-the-number-of-columns-in-an-HBase-row, there is no limit for the number of columns or any known problem except the lock when writing to that row. So why your statement "bad practice" of having many columns? But as we do write on a row only when doing your ingestion/import and do often read a row. I think this would be acceptable. – Matthias Mueller Jun 01 '17 at 06:43
  • While designing the schema, you need to choose between long or wide columns. It depends on the use case. The reason you choose to have wide column does not seem to suit the reading requirements. AS I mentioned, along with lock, you can also end up with memory issues while trying to scan entire row. These are a few. If the row have more attributes, it makes sense to have more columns. But for querying 10 million queries per day, choosing wide column schema, don't seem to help, as there are other ways. – Ramzy Jun 01 '17 at 12:11
  • Thanks @Ramzy that helped! And how about the idea of using the epoch time stamp of each event resulting in a many rows schema? There I could easily scan for a whole day too. – Matthias Mueller Jun 01 '17 at 15:42
  • That will result in a hot spotting issue in hbase ( its due to continuously increasing data being used as row key.). You can add the time stamp info as part of row key and then salt it to avoid hot spotting, and then use prefix scan. Thats one possibility. Other being scan on a specific column value which have date(or day like 20170601) using filter. – Ramzy Jun 01 '17 at 16:20
  • Let me add an article about HBase hot spotting: https://sematext.com/blog/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ – Matthias Mueller Jun 02 '17 at 04:45