I'm working on an event processing system where I have to read my event data from a hbase table. The events I read are stored based on their timestamp. When I read in a whole day (24 hours), I find periods on the day where I 1 million events per hour (e.g. during regular business hours) and other periods where I only get several thousand events. So when I equally partition a day, I will get partitions (and workers) with a lot of work and some with low work. Is there any concept on how I could partition my day so that in the off time I use more hours per partition to process and for the main hours I use less hours? This would result in something like: * from 0-6am use 4 partitions * from 6am to 6pm use 60 partitions * from 6pm to 12am use 6 partitions
Asked
Active
Viewed 50 times
1 Answers
0
If you just you timestamp for row key this means you already have problems with region hot spotting, even before any processing. Simple solution is to add sharding key before timestamp.
Row key = (timestamp % number of regions) + timestamp.
This will equally distribute rows accross regions.

gorros
- 1,411
- 1
- 18
- 29
-
Thanks @gorros for the hint but I already use a salt to archive good distribution over all region servers. – Matthias Mueller Jul 04 '17 at 20:36
-
Then as I understand from data locality priciple, the number of executors must be at leat equal to number of region servers. And since data is distributed evenly between servers, each executer will get same amount. Also number of partitions is equal to regions as I remember . – gorros Jul 05 '17 at 03:30