How to persist a graph data structure in a relational database?

Question

I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?

Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.

In order to give you valuable advice I will need more info from your side. How many Nodes and how many relations we are talking about? — Antoan Milkov, Sep 03 '13 at 10:20
Well I'd say billions of nodes. Like I said this is mostly conceptual but I'm curious how to scale for lots of records. I have very large graphs in mind I guess. — Frank Flannigan, Sep 03 '13 at 14:23
Not open source but it is exactly what you are looking for: new Aster 6.0 comes with graph engine within relational database -it's called SQL-GR and is aimed at using existing and new functions on graphs stored in relational tables (in Aster): represented with node table and edge table. — topchef, Oct 31 '13 at 20:43
I read that diaspora does graph using RDBMS it but not sure what the hardware requirements would be to make it support a huge base (like 70% of facebook) or of that is even possible. — tgkprog, Oct 14 '14 at 16:19

score 53 · Accepted Answer · edited Jan 11 '18 at 09:52

The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).

The point is...

Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.

Alternative 1 - Graph oriented database

But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.

Alternative 2 - working with graphs in memory

When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.

+1 BTW: the really only difference between 'graph DB' and 'relational DB' is the **implementation** of the lookup. If the edge list referenced in the node table is reached through a direct pointer one can call it graph DB although the data can still be organized in tables! So, if this lookup is log(n) per edge list or even per edge then people call it a 'relational DB' and traversing the graph is rather expensive (independent of the fact if the storage is mmapped or in-memory or else). — Karussell, Nov 30 '15 at 17:11
@Karussell it's noteworthy that most SQL databases have support for hash-based indexes, with that the edge/vertex lookup is O(1), just as for a graph database. O(log(n)) query time is usually associated with B-tree based indexes, which are mostly used when sorting of data is important (which for edge/vertex IDs usually isn't relevant). — ThePhysicist, Oct 20 '16 at 10:22
You are probably right. Still a hash based index has overhead (space&time) in practise IMO compared to a direct pointer. But probably the technology used is very similar for both DBs and just the marketing blabla makes them appear very different :) — Karussell, Oct 20 '16 at 10:56
@Karussell log(n) does not imply the lookup is slow. I doubt it would come close to everything else a query needs to do. Also, it may be nice to have a reference, but what if you need to traverse the reverse relationship? — Justin Meiners, Feb 23 '22 at 00:03

score 14 · Answer 2 · answered Oct 29 '17 at 09:40

I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:

Store nodes in a table and reference the graph with each node record:

Table Nodes

id  | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2

Also store edges in another table and again reference the graph these edges belong to with each edge:

Table Edges

id | from_node_id | to_node_id | graph_id
-----------------------------------------
1  | 105          | 106        | 2
2  | 106          | 105        | 2

Get all the nodes with one query, then get all the edges with another.

Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.

Stas Ivanov · Answer 3 · 2019-06-04T09:47:34.003

Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.

It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).

It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):

SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';

There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.

Justin Meiners · Answer 4 · 2022-11-24T22:56:56.883

I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).

However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.

Let's review your design:

one table for vertices (id, data)
one table for edges (startId, endId, data)

First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.

Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).

Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters

That's all the basic queries you need.

Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well. Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.

O(log(n)) vs O(1)

Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table. Even if you are stuck with sorted it's going to perform very well.

First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.

How to persist a graph data structure in a relational database?

4 Answers4

Linked