4

I'm trying to evaluate what might work best for the following use-case:

There exists a set of entities that can be represented as a graph. Each vertex in the graph represents an entity, and each (uni-directional edge) represents a child-to-parent relationship. An entity may have multiple parents, and a parent may have multiple child entities. Usually, there is a "master" entity to which all entities can trace back. No entity can be removed. The requirement is that it should be easy to trace all the ancestors of any entity. The following are some conditions on the basis of which I'd like to evalute:

  1. deep trees (the highest ancestor can be far away) vs. shallow trees (the highest ancestor is usually not far away)
  2. broad traversal paths (a vertex can have many parents) vs. narrow traversal paths (a vertex usually does not have many parents)
  3. any other important conditions that I've missed

Using this graph as an example:

graph_db_compare

In a regular DynamoDB-like database, this would be represented as:

-------------------
entity | parents  |
-------------------
A      | []       |
-------------------
B      | [A]      |
-------------------
C      | [A]      |
-------------------
D      | [A]      |
-------------------
E      | [B, C, D]|
-------------------
F      | [C, D]   |
-------------------

A pre-existing condition is:

I'm far more familiar with DynamoDB, but have only very basic familiarity with NeptuneDB or any graph database, and therefore DynamoDB requires lesser up-front time investement. On other hand, NeptuneDB is of course better suited for relationship-graph storage, but under what conditions is it worth the technical overhead?

Lakshay Sharma
  • 827
  • 1
  • 7
  • 19
  • DynamoDB has no support built in to manage relationships. In a DynamoDB implementation, how would you efficiently find vertices with edges to D? How would you efficiently remove vertex C from your 'graph'? – jarmod Jan 04 '20 at 20:55
  • Forgot to add: in this problem space, vertices can't be removed – Lakshay Sharma Jan 04 '20 at 20:58

1 Answers1

9

There are of course many ways to model and store connected data. As you have observed you could store a graph using adjacency lists as in your example. When working with highly connected data, where a Graph Database such as Amazon Neptune can really help is with the creation and execution of queries. For example, using the Gremlin query language (Neptune supports both TinkerPop/Gremlin and RDF/SPARQL), finding the most distant ancestor of vertex 'E" can be as simple as:

g.V('E').repeat(out()).until(__.not(out()))

No matter how deep the tree gets, the query stays the same. If you were to model the data using adjacency lists you would have to write code to traverse the "graph" yourself. A graph database engine like Amazon Neptune, is optimized to efficiently execute these types of query.

So in summary, you could do it using Dynamo or using Neptune but if the graph becomes complex then using a Graph Database with a built in set of graph querying capabilities should make the work you have to do a lot easier when writing queries to traverse the graph. The decision will come down to, as you note, the trade off between reusing what you already know well versus learning something new to gain the ability to easily write and execute queries no matter how complex the connected data becomes. I hope this helps you make that decision.

You will find a simple example of using Gremlin to model and traverse a tree here:

http://www.kelvinlawrence.net/book/PracticalGremlin.html#btree

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38