7

I just learned about graph databases as opposed to relational databases (RDBMS). I went through some resources on the neo4j website and read some chapters in the Oreilly Book on Graph Databases. However i cant get my head around how a graph database actually stores its data?

If i had to store a graph in a RDBMS i would create diffrent lists for vertices and nodes. How is the Graph Database diffrent? I really struggle to picture in my head how e.g. neo4j stores and links its data (the nodes and vertices) diffrently to conventional RDBMS.

If anyone could help me understand and visiulize how a graph database internally works, id be really thankful. If you dont quite understand my question, im happy to explain it more specifically.

MST
  • 79
  • 1
  • 5
  • I don't know neo4j but I have worked with software images and graphics. Graphs and line work are stored as vectors. Think back to basic algebra and geometry when we drew shapes on graph paper. What did you know? The points on the graph. That's what's stored. The points of the vectors making up the graph. – LAS Feb 14 '18 at 00:35

1 Answers1

11

Your answer is the chapter 6 of the O’Reilly’s Graph Databases book about Graph Database Internals. This chapter describes how Neo4j works internally, including how the Native Graph Storage works.

Neo4j stores nodes, relationships, labels and properties in separated files.

Nodes are stored in the file neostore.nodestore.db. This file has a fixed-size by each new created node. For each node added to the database this file is increased by 9 bytes. This way a node with id 100 can be easily found in the 900 byte into the file (id 100 x 9 bytes per node = 900 byte). Node records have pointers to the first node relationship, first node property and for the node labels.

Relationships are stored in the file neotore.relationshipstore.db. This is a fixed-size file too. Each relationship has pointers to the start and end nodes, relationship type (in the neostore.relationshiptypestore.db file), next and previous relationship records for each of the start and end nodes, and a flag indicating if the relationship is the first in the relationship chain.

Properties of nodes and relationships are stored in the file neostore.propertystore.db. Each property record has a pointer to the next property and can holds a maximum of 4 properties. Each property has a pointer to the property name (neostore.propertystore.db.index file), the property type. The property value can be an inline value or a pointer to a dynamic file for large strings (neostore.propertystore.db.strings) and arrays (neostore.propertystore.db.arrays file).

Bruno Peres
  • 15,845
  • 5
  • 53
  • 89
  • 3
    Thank you very much, but i am still confused on how this is so much more efficient to look up...when you want to know all the relationships of a particular node, dont you still have to search through the whole list of relationships to find them all? – MST Feb 21 '18 at 20:51
  • 1
    @Maclaren In parts. For example, in this query: `match (n:Node)-[r]->() return type(r)` Neo4j will select the `n` as a the start point to walk over the graph. To get this start point Neo4j will use the index implicitly created for `:Node` label. Then to expand to `r` relationships Neo4j will iterate only over relationships connected to `r` start point and **not over all relationships in the whole graph**. – Bruno Peres Feb 23 '18 at 11:58
  • 2
    @Maclaren also, Neo4j should be seriously considered when handling highly connected data because of the [index-free adjacency and native graph processing engine](https://neo4j.com/blog/native-vs-non-native-graph-technology/). Neo4j has no `JOINS`. To transverse the graph Neo4j follows the relationships connected to each node, at a very low cost instead of doing `JOINS` and calc cartesian products, like in traditional relational databases. – Bruno Peres Feb 23 '18 at 12:08
  • Basically it's a trade off. Performing lots of key-lookups in a relational database can become quite expensive. (Lots being tens of thousands.) This makes key-lookups as cheap as following a pointer, at the expense of making index scans for large amounts of data really, really expensive. – Jonathan Allen Oct 04 '21 at 02:57