I'm developing a Hbase storage for data generated from different sources. Usually columns from the same source are more likely to be retrieved at the same time. The expected write/read ratio roughly range from 1/10 to 1/100 (depends on different sources).
So there're two choices for me:
- Multiple column families: just create one table and multiple column families, each kinds of data from the same source will form a column family.
- Multiple tables: create one table (with one column family) for each source.
Here're some of my understanding, please correct me if anything wrong.
- Multiple-tables solution works fine for dynamic adding new sources. While multiple-column-families solution may have downtime.
- If the rowkey of different sources have different distribution (for example, int user_id vs image GUID) or cardinality, maybe it's better to split into different tables?
- We may have some requirements to retrieve columns from different sources for the same rowkey at the same time. In this way, multiple column families may be faster (not sure)?
Any suggestions or do I need to consider any other factors before make the decision? Are there any typical cases multiple-tables/multiple-column-families outperforms the other?
Thanks