0

Say we were modeling Users and Friends, and Friends have a type.

We could model it in Oracle like:

User: id, name, sex, age
Friendship: user_id, friend_id, type

So in HBase, we could do:

(this first model is from here, which is recommended by the HBase FAQ)

Table: Users
RowKey = <user_id>
Column Family = Info; Columns = "Name", "Sex", "Age"
Column Family = Friend; Columns = "Friend:<user_id>"=type

(where "Friend:"=type could be one more more user_ids)

or

Table: Users
RowKey = <user_id>
Column Family = Info; Columns = "Name", "Sex", "Age", "Friends"

(where "Friends" is a JSON string in the form [{user_id:, type:}, ...]

However, if a friend did not have a type, the second model could simply be [user_id:<user_id>, ...]. What would the first model do if friends didn't have a type?

What are the pros and benefits of either approach?

Matthew Moisen
  • 16,701
  • 27
  • 128
  • 231

2 Answers2

0

One column with a list of values breaks normalization rules. If you don't know what those are or why they're important, please do some research.

I don't think either model is correct. A one to many relationship ought to be modeled correctly. Both your schemas break normalization rules.

duffymo
  • 305,152
  • 44
  • 369
  • 561
0

It really depends on how many friends you have and what your read and write access pattern is.

In the first case, with a friend per column you can add a friend without reading all of the other friends. However, you also get a separate timestamp value per friend and thus increase the total storage requirement per friend.

Also, if you don't always read the friends when you read a user, the first case doesn't require you to load the friends. You can do a single column family scan and avoid all the extra IO.

The downside to more column families is you have more MemStores and therefore more memory is required for your regions. It also means more non-sequential disk flushing as each column family is a separate disk flush.

b4hand
  • 9,550
  • 4
  • 44
  • 49