HBase - What are the pros and cons of using one column with a list of values vs using one column family with a list of columns?

Question

Say we were modeling Users and Friends, and Friends have a type.

We could model it in Oracle like:

User: id, name, sex, age
Friendship: user_id, friend_id, type

So in HBase, we could do:

(this first model is from here, which is recommended by the HBase FAQ)

Table: Users
RowKey = <user_id>
Column Family = Info; Columns = "Name", "Sex", "Age"
Column Family = Friend; Columns = "Friend:<user_id>"=type

(where "Friend:"=type could be one more more user_ids)

or

Table: Users
RowKey = <user_id>
Column Family = Info; Columns = "Name", "Sex", "Age", "Friends"

(where "Friends" is a JSON string in the form [{user_id:, type:}, ...]

However, if a friend did not have a type, the second model could simply be [user_id:<user_id>, ...]. What would the first model do if friends didn't have a type?

What are the pros and benefits of either approach?

score 0 · Answer 1 · answered Mar 20 '14 at 18:45

0

One column with a list of values breaks normalization rules. If you don't know what those are or why they're important, please do some research.

I don't think either model is correct. A one to many relationship ought to be modeled correctly. Both your schemas break normalization rules.

answered Mar 20 '14 at 18:45

duffymo

305,152
44
369
561

This is for HBase, not the relational database. I've edited for clarity. – Matthew Moisen Mar 20 '14 at 18:46

score 0 · Answer 2 · answered Mar 20 '14 at 23:54

It really depends on how many friends you have and what your read and write access pattern is.

In the first case, with a friend per column you can add a friend without reading all of the other friends. However, you also get a separate timestamp value per friend and thus increase the total storage requirement per friend.

Also, if you don't always read the friends when you read a user, the first case doesn't require you to load the friends. You can do a single column family scan and avoid all the extra IO.

The downside to more column families is you have more MemStores and therefore more memory is required for your regions. It also means more non-sequential disk flushing as each column family is a separate disk flush.

HBase - What are the pros and cons of using one column with a list of values vs using one column family with a list of columns?

2 Answers2