22

Suppose I want to store relationships among the users of my application, similar to Facebook, per se.

That means if A is a friend(or some relation) of B, then B is also a friend of A. To store this relationships I am currently planning to store them in a table for relations as follows

  UID      FriendID
 ------    --------
 user1      user2
 user1      user3
 user2      user1

However I am facing two options here:

  1. The typical case, where I will store both user1 -> user2 and user2->user1. This will take more space, but (at least in my head) require just one pass over the rows to display the friends of a particular user.
  2. The other option would be to store either user1->user2 OR user2->user1 and whenever I want to find all the friends of user1, I will query on both columns of table to find a user's friends. It will take half the space but (again at least in my head) twice the amount of time.

First of all, is my reasoning appropriate? If yes, then are there any bottlenecks that I am forgetting (in terms of scaling / throughput or anything)?

Basically, are there any trade-offs between the two, other than the ones listed here. Also, in industry is one preferred over the other?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ankit
  • 6,772
  • 11
  • 48
  • 84

3 Answers3

24

Here is how these two approaches will be physically represented in the database:

enter image description here

Let us analyze both approaches...

Approach 1 (both directions stored in the table):

  • PRO: Simpler queries.
  • CON: Data can be corrupted by inserting/updating/deleting only one direction.
  • MINOR PRO: Doesn't require additional constraints to ensure a friendship cannot be duplicated.
  • Further analysis needed:
    1. TIE: One index covers both directions, so you don't need a secondary index.
    2. TIE: Storage requirements.
    3. TIE: Performance.

Approach 2 (only one direction stored in the table):

  • CON: More complicated queries.
  • PRO: Can't corrupt the data by forgetting to handle the opposite direction, since there is no opposite direction.
  • MINOR CON: Requires CHECK(UID < FriendID), so a same friendship can never be represented in two different ways, and the key on (UID, FriendID) can do its job.
  • Further analysis needed:
    1. TIE: Two indexes are necessary to cover both directions of querying (composite index on {UID, FriendID} and composite index on {FriendID, UID}).
    2. TIE: Storage requirements.
    3. TIE: Performance.

The point 1 is of special interest. MySQL/InnoDB always clusters data, and secondary indexes can be expensive in clustered tables (see "Disadvantages of clustering" in this article), so it might seem as if the secondary index in approach 2 would eat-up all the advantages of fewer rows. However, the secondary index contains the exact same fields as the primary (only in the opposite order) so there is no storage overhead in this particular case. There is also no pointer to table heap (since there is no table heap), so it's probably even cheaper storage-wise that a normal heap-based index. And assuming the query is covered with the index, there won't be a double-lookup normally associated with a secondary index in a clustered table either. So, this is basically a tie (neither approach 1 nor approach 2 has significant advantage).

The point 2 is related to the point 1: it doesn't matter whether we will have a B-Tree of N values or two B-Trees, each with N/2 values. So this is also a tie: both approaches will use-up approximately same amount of storage.

The same reasoning applies to point 3: whether we search one larger B-Tree or 2 smaller ones, doesn't make much of a difference, so this is also a tie.

So, for the robustness, and despite somewhat uglier queries and a need for additional CHECK, I'd go with the approach 2.

Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • Approach one : CON : could be addressed by enforcing a constraint, but having to declare that constraint might be considered a design-level CON in itself, and its having to be checked might be considered a runtime CON in itself ... – Erwin Smout May 30 '12 at 14:54
  • 1
    Approach two : PRO : things are not that simple because in fact now a constraint is needed to PREVENT the opposite direction from appearing in the table (either through a declared constraint or, more typically, by enforcing some "canonical form" upon the row that does appear). That's your "minor CON", so how is this a PRO ? ... – Erwin Smout May 30 '12 at 14:57
  • Plus, in approach two, you're most likely pressed to define a view that "rebuilds approach 1 from approach 2" so approach 1 is available at least for read purposes ... It seems unlikely that read performance in this approach will be a tie to read performance in approach 1 ... – Erwin Smout May 30 '12 at 15:03
  • @ErwinSmout I don't understand your arguments. Comment 1: show me a declarative constraint that does that. Comment 2: that's what CHECK is for. Comment 3: I did acknowledge the need for more complicated querying (though I don't see any pressing need for a view), but the performance would be quite similar. _Why_ do you think it would be different? – Branko Dimitrijevic May 30 '12 at 15:51
  • "A declarative constraint that does that" : CREATE ASSERTION 0 = SELECT COUNT(*) FROM T where not exists (select 1 from t as t2 where t.x=t2.y and t.y=t2.x); And now you're going to tell me this answer doesn't count because the products don't support ASSERTION. But whether prodcuts support it or not doesn't matter. The logical rule is there and for the data to be correct, it must be enforced. Declarative or not. – Erwin Smout May 31 '12 at 06:42
  • Why I think read performance will always be slightly (not necessarily noticeable though - I did not claim that) worse : because regardless of whether some view is defined or the user is forced to write the UNION himself, answering a query "give me all friends of ..." will (almost necessarily) force the system to access distinct physical places in approach 2, which always takes longer (though not necessarily in noticeable ways), than answering the query in approach one, which will typically require access to a single physical location only. – Erwin Smout May 31 '12 at 06:52
  • @ErwinSmout _"But whether prodcuts support it or not doesn't matter."_ Well, it matters if you intend to ship your product. Also this kind of constraint could be quite expensive to enforce, even if it were supported. As for performance, I agree there might be a slight difference. That's why I used words like "doesn't make much of a difference" instead of "identical". Interestingly, on a DBMS that supports parallel querying, this difference could under certain conditions be slightly in favor of the approach 2. – Branko Dimitrijevic May 31 '12 at 07:28
  • "Also this kind of constraint could be quite expensive to enforce" suggests that data integrity is subordinate to performance ? "If I can't trust the answers, I don't care how quick I'm getting them." (C.J.Date). – Erwin Smout May 31 '12 at 10:26
  • @ErwinSmout Dude, stop putting words in my mouth. It doesn't suggest that at all. This is a simple factual statement, one that can influence the decision-making process between two alternatives discussed here. If the other alternative didn't offer integrity than you'd have a point. Since it does, you don't. – Branko Dimitrijevic May 31 '12 at 11:18
5

Storage is relatively cheap these days, so I would not worry about it because of that.

What would concern me is that you must now clean up as you are storing the information twice. So if you "unfriend" someone, you have to remove 2 records, not just one.

The other considerations are searches and indexing. There could be advantages of hashing the combination of 2 users ids to check for existance, provided you follow a consistant convention (like always append the higher id to the lower before hashing).

So now you have other possibilities. Are you interested on querying the relationship between the 2 users? Or is it more important to look at the attributes of one user?

These are concerns about what the system will do. Take a look at subjects like DDD (Domain Driven Design) and CQRS (Command Query Responsibility Segregation) to see how to divide up your app so each area is implemented in the simplest way possible. This will give you avenues to fine tune and optimize later without running into complexity issues.

Adam Dymitruk
  • 124,556
  • 26
  • 146
  • 141
1

While choosing between option 1 and option 2 in Branko Dimitrijevic's options you should consider this :

Is the relation you're trying to design symetrical or assymetrical ?

For example (bad example but still illustrates my point) if you're only trying to know whether both users are family or friends, then the link is symetrical. If one user is a member of the other's family then the inverse is true. Approach 2 might be considered.

But if you want more specific information like what type of family one person is to another (is it their father, son, uncle ?) then it becomes assymetrical. If A is B's father then B is A's son/daughter. Approach 1 might become necessary.