I'm thinking of using HBase to store logs (web log data), each log would have about 20 different values (let's say columns), I want to run queries that filter results based on those columns.
My initial idea was to save each log (cell) multiple times under each column which is value of each field in log. This would cause about 20x increase in data size, but I think this gives good increase in performance. Row-key would be timestamp with prefix which is source id.
Each source will generate about 40-100M log lines (there might be tens of thousand of sources).
I also need low latency, possibly below 10 seconds (so solutions like Hive are currently not a option)
Do you think this is right schema design? If not what to you think would be right one, or maybe I should use something else (what)?
Thanks for all your answers.