We are looking at using HBase for real-time analytics.
Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).
My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.
For example, assume that we store the following lines in a database:
timestamp, client_ip, url, referrer, useragent
and say our map-reduce job produces three different output fields, each of which we want to store in a separate "table" (HBase column family):
- date, operating_system, browser
- date, url, referrer
- date, url, country
(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)
My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:
- operating_system by date ("How many iPad users in this date range?")
- url by country and date ("How many users to this url from this country for the last month?")
and basically any other custom query?
Should we use keys like this:
- date_os_browser
- date_url_referrer
- date_url_country
and if so, can we fulfill the sort of queries specified above?