9

I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.

My question would be what is better in terms of performance and memory: a) Implement it as a node property and have lots of duplicates (and search using WHERE). b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).

Martynas
  • 1,064
  • 10
  • 21

3 Answers3

6

Without knowing further details it's hard to give a general purpose answer.

From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.

Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.

Stefan Armbruster
  • 39,465
  • 6
  • 87
  • 97
  • I really like your idea to use value as relationship type. Then I need just 1 useless node with no data to point into with these new relationships. Another question would be: will there be any performance issues with 1 million relationships to a single node? I won't be doing any searches from that node, only to it. Also, would it make sense to use root node for it? – Martynas Mar 18 '13 at 09:53
  • 2
    As long as you traverse only to that node there is no performance penalty. If you go the other direction, you'll have to scan 1 million relationships of course. – Stefan Armbruster Mar 18 '13 at 12:31
  • I would not recommend to use the root node. Maybe create a new node that is connected to the root node for this. Otherwise you might populate the root node with multiple different concepts which is bad modeling strategy in my opinion. – Stefan Armbruster Mar 18 '13 at 12:32
4

Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:

Eduardo Pareja Tobes
  • 3,060
  • 1
  • 18
  • 19
1

I've thought about this problem a little as well. In my case, I had to represent state:

  • STARTED
  • IN_PROGRESS
  • SUBMITTED
  • COMPLETED

Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).

Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").

In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.

Matt Wielbut
  • 2,584
  • 25
  • 29