0

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.

What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all? Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?

Note: I intend to implement this using the Gremlin console in Amazon Neptune.

Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35

1 Answers1

1

The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.

The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).

The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.

Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.

Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38