8

I want to store undirected graph edges (for example, for friends). To store and retrieve all friends of node a, one can use:

Create two rows per edge, query on one column per node:

+--------------------------+
| id | from_node | to_node |
+--------------------------+
| 1  |  a        |  b      |
| 2  |  b        |  a      |
+--------------------------+
SELECT * FROM `x` WHERE from_node = a

Create one row per edge, use OR:

+--------------------------+
| id | node_a    | node_b  |
+--------------------------+
| 1  |  a        |  b      |
+--------------------------+
SELECT * FROM `y` WHERE node_a = a OR node_b = a

Which makes for more efficient lookups?

  • Table x with 2n rows, indices on from_node and to_node, lookup on one column
  • Table y with n rows, indices on node_a and node_b, lookup on both columns using OR
atp
  • 30,132
  • 47
  • 125
  • 187
  • 2
    I would suspect that version Y will at least be no slower than version X -- at any rate, version Y would be the correct way to model the data. – cdhowie Sep 01 '11 at 23:46
  • 2
    On another note, the main issue I do see with version Y is that you really should, at the DB level, ensure that the node IDs (1, 2) can't be inserted alongside (2, 1). A simple unique index is not going to cut it. Version X will be *simpler* to architect at the DB level even though it's not technically normalized, but version Y might be impossible to do properly in MySQL, where you don't really have any sort of flexible constraint mechanism you can tailor to this schema. If you're not 100% sold on MySQL, you might take a look at PostgreSQL, which has support for user-defined constraints. – cdhowie Sep 01 '11 at 23:54
  • 1
    Further -- (sorry for the rambling, I really *am* trying to be helpful and not just make noise) -- if you go with version Y, the indexes should be (node_a, node_b) and (node_b) -- if you don't include node_b in the node_a index then lookups for a *specific* row (`node_a = ? AND node_b = ?`) will be suboptimal. – cdhowie Sep 01 '11 at 23:57
  • Good point, I would have to do that. I must use MySQL, but I'm not concerned with db constraints (will use app constraints). – atp Sep 01 '11 at 23:59
  • 2
    App constraints might be susceptible to concurrency problems... if two separate processes/threads go to add the (2,1) and (1,2) pairs then a race might cause both rows to be inserted. You might consider a rule like "node_a must be less than node_b" -- so your application would enforce that (1,2) and (2,1) both get inserted as (1,2). Then a simple two-column unique index would solve the problem nicely. – cdhowie Sep 02 '11 at 00:05

2 Answers2

5

This is likely to be far too out of date to be useful, but I'll post incase it helps other people!

I store undirected graphs like your second example and have a constraint that node_a has to be less than node_b. You then trivially place a UNIQUE constraint on the pair and know that the data is consistent. Queries have to a bit more work by compare node_a to the smaller of {a,b} and node_b the other value. PostgreSQL (the DB I know best) provides GREATEST() and LEAST() functions that help here.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
2

if you optimise everything, then X will be fastest, assuming that you read data from disk and are querying for friends of a single person. that's because you can arrange your data on disk so that they are ordered to match one index, which is the one you are querying. so, for a single person, you only need to do one disk seek. Y requires queries on two indices, so may imply multiple seeks to retrieve friends, even for a single person (and disk access time usually dominates simple queries).

see clustered indices at wikipedia (and the mysql manual)

if you are lucky enough to know that data will always be in memory then they will likely both be "fast enough" (and even if the data are on disk they may be fast enough - i am not saying X is the best design, only that it can be made most efficient).

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • Yeah, this is probably right. My only concerns with this approach are that it's not really the "right way" to model the data, and MySQL doesn't support clustering on anything other than the primary key. That's not to say that your answer isn't valid, these are just my concerns with model X. – cdhowie Sep 02 '11 at 00:14
  • yes - i extended my answer as you were posting to clarify that this is purely based on efficiency... – andrew cooke Sep 02 '11 at 00:16