24

Suppose you're Twitter, and:

  • You have (:User) and (:Tweet) nodes;
  • Tweets can get flagged; and
  • You want to query the list of flagged tweets currently awaiting moderation.

You can either add a label for those tweets, e.g. :AwaitingModeration, or add and index a property, e.g. isAwaitingModeration = true|false.

Is one option inherently better than the other?

I know the best answer is probably to try and load test both :), but is there anything from Neo4j's implementation POV that makes one option more robust or suited for this kind of query?

Does it depend on the volume of tweets in this state at any given moment? If it's in the 10s vs. the 1000s, does that make a difference?

My impression is that labels are better suited for a large volume of nodes, whereas indexed properties are better for smaller volumes (ideally, unique nodes), but I'm not sure if that's actually true.

Thanks!

Aseem Kishore
  • 10,404
  • 10
  • 51
  • 56
  • I don't really know but I would think that the label would be more efficient. If you use the label then you can exclude all of the `(:Tweet)` nodes by not even matching on them. If you use the property method on the `(:Tweet)` node then your match will still include the `Tweet` label in the match. In the relational or directory worlds i don't think you would index the property value as it would have low selectivity. I am be interested to see the answers though. – Dave Bennett Jan 15 '15 at 04:02

1 Answers1

35

UPDATE: Follow up blog post published.

This is a common question when we model datasets for customers and a typical use case for Active/NonActive entities.

This is a little feedback about what I've experienced valid for Neo4j2.1.6 :

Point 1. You will not have difference in db accesses between matching on a label or on an indexed property and return the nodes

Point 2. The difference will be encountered when such nodes are at the end of a pattern, for example

MATCH (n:User {id:1})
WITH n
MATCH (n)-[:WRITTEN]->(post:Post)
WHERE post.published = true
RETURN n, collect(post) as posts;

-

PROFILE MATCH (n:User) WHERE n._id = 'c084e0ca-22b6-35f8-a786-c07891f108fc'
> WITH n
> MATCH (n)-[:WRITTEN]->(post:BlogPost)
> WHERE post.active = true
> RETURN n, size(collect(post)) as posts;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n                                                                                                                                                         | posts |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node[118]{_id:"c084e0ca-22b6-35f8-a786-c07891f108fc",login:"joy.wiza",password:"7425b990a544ae26ea764a4473c1863253240128",email:"hayes.shaina@yahoo.com"} | 1     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row

ColumnFilter(0)
  |
  +Extract
    |
    +ColumnFilter(1)
      |
      +EagerAggregation
        |
        +Filter
          |
          +SimplePatternMatcher
            |
            +SchemaIndex

+----------------------+------+--------+----------------------+----------------------------------------------------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                                                                      Other |
+----------------------+------+--------+----------------------+----------------------------------------------------------------------------+
|      ColumnFilter(0) |    1 |      0 |                      |                                                      keep columns n, posts |
|              Extract |    1 |      0 |                      |                                                                      posts |
|      ColumnFilter(1) |    1 |      0 |                      |                                           keep columns n,   AGGREGATION153 |
|     EagerAggregation |    1 |      0 |                      |                                                                          n |
|               Filter |    1 |      3 |                      | (hasLabel(post:BlogPost(1)) AND Property(post,active(8)) == {  AUTOBOOL1}) |
| SimplePatternMatcher |    1 |     12 | n, post,   UNNAMED84 |                                                                            |
|          SchemaIndex |    1 |      2 |                 n, n |                                                {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+----------------------------------------------------------------------------+

Total database accesses: 17

In this case, Cypher will not make use of the index :Post(published).

Thus the use of labels is more performant in the case you have a ActivePost label for e.g. :

neo4j-sh (?)$ PROFILE MATCH (n:User) WHERE n._id = 'c084e0ca-22b6-35f8-a786-c07891f108fc'
> WITH n
> MATCH (n)-[:WRITTEN]->(post:ActivePost)
> RETURN n, size(collect(post)) as posts;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n                                                                                                                                                         | posts |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node[118]{_id:"c084e0ca-22b6-35f8-a786-c07891f108fc",login:"joy.wiza",password:"7425b990a544ae26ea764a4473c1863253240128",email:"hayes.shaina@yahoo.com"} | 1     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row

ColumnFilter(0)
  |
  +Extract
    |
    +ColumnFilter(1)
      |
      +EagerAggregation
        |
        +Filter
          |
          +SimplePatternMatcher
            |
            +SchemaIndex

+----------------------+------+--------+----------------------+----------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                            Other |
+----------------------+------+--------+----------------------+----------------------------------+
|      ColumnFilter(0) |    1 |      0 |                      |            keep columns n, posts |
|              Extract |    1 |      0 |                      |                            posts |
|      ColumnFilter(1) |    1 |      0 |                      | keep columns n,   AGGREGATION130 |
|     EagerAggregation |    1 |      0 |                      |                                n |
|               Filter |    1 |      1 |                      |     hasLabel(post:ActivePost(2)) |
| SimplePatternMatcher |    1 |      4 | n, post,   UNNAMED84 |                                  |
|          SchemaIndex |    1 |      2 |                 n, n |      {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+----------------------------------+

Total database accesses: 7

Point 3. Always use labels for positives, meaning for the case above, having a Draft label will force you to execute the following query :

MATCH (n:User {id:1})
WITH n
MATCH (n)-[:POST]->(post:Post)
WHERE NOT post :Draft
RETURN n, collect(post) as posts;

Meaning that Cypher will open each node label headers and do a filter on it.

Point 4. Avoid having the need to match on multiple labels

MATCH (n:User {id:1})
WITH n
MATCH (n)-[:POST]->(post:Post:ActivePost)
RETURN n, collect(post) as posts;

neo4j-sh (?)$ PROFILE MATCH (n:User) WHERE n._id = 'c084e0ca-22b6-35f8-a786-c07891f108fc'
> WITH n
> MATCH (n)-[:WRITTEN]->(post:BlogPost:ActivePost)
> RETURN n, size(collect(post)) as posts;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n                                                                                                                                                         | posts |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node[118]{_id:"c084e0ca-22b6-35f8-a786-c07891f108fc",login:"joy.wiza",password:"7425b990a544ae26ea764a4473c1863253240128",email:"hayes.shaina@yahoo.com"} | 1     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row

ColumnFilter(0)
  |
  +Extract
    |
    +ColumnFilter(1)
      |
      +EagerAggregation
        |
        +Filter
          |
          +SimplePatternMatcher
            |
            +SchemaIndex

+----------------------+------+--------+----------------------+---------------------------------------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                                                         Other |
+----------------------+------+--------+----------------------+---------------------------------------------------------------+
|      ColumnFilter(0) |    1 |      0 |                      |                                         keep columns n, posts |
|              Extract |    1 |      0 |                      |                                                         posts |
|      ColumnFilter(1) |    1 |      0 |                      |                              keep columns n,   AGGREGATION139 |
|     EagerAggregation |    1 |      0 |                      |                                                             n |
|               Filter |    1 |      2 |                      | (hasLabel(post:BlogPost(1)) AND hasLabel(post:ActivePost(2))) |
| SimplePatternMatcher |    1 |      8 | n, post,   UNNAMED84 |                                                               |
|          SchemaIndex |    1 |      2 |                 n, n |                                   {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+---------------------------------------------------------------+

Total database accesses: 12

This will result in the same process for Cypher that on point 3.

Point 5. If possible, avoid the need to match on labels by having well typed named relationships

MATCH (n:User {id:1})
WITH n
MATCH (n)-[:PUBLISHED]->(p)
RETURN n, collect(p) as posts

-

MATCH (n:User {id:1})
WITH n
MATCH (n)-[:DRAFTED]->(post)
RETURN n, collect(post) as posts;

neo4j-sh (?)$ PROFILE MATCH (n:User) WHERE n._id = 'c084e0ca-22b6-35f8-a786-c07891f108fc'
> WITH n
> MATCH (n)-[:DRAFTED]->(post)
> RETURN n, size(collect(post)) as posts;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n                                                                                                                                                         | posts |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node[118]{_id:"c084e0ca-22b6-35f8-a786-c07891f108fc",login:"joy.wiza",password:"7425b990a544ae26ea764a4473c1863253240128",email:"hayes.shaina@yahoo.com"} | 3     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row

ColumnFilter(0)
  |
  +Extract
    |
    +ColumnFilter(1)
      |
      +EagerAggregation
        |
        +SimplePatternMatcher
          |
          +SchemaIndex

+----------------------+------+--------+----------------------+----------------------------------+
|             Operator | Rows | DbHits |          Identifiers |                            Other |
+----------------------+------+--------+----------------------+----------------------------------+
|      ColumnFilter(0) |    1 |      0 |                      |            keep columns n, posts |
|              Extract |    1 |      0 |                      |                            posts |
|      ColumnFilter(1) |    1 |      0 |                      | keep columns n,   AGGREGATION119 |
|     EagerAggregation |    1 |      0 |                      |                                n |
| SimplePatternMatcher |    3 |      0 | n, post,   UNNAMED84 |                                  |
|          SchemaIndex |    1 |      2 |                 n, n |      {  AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+----------------------------------+

Total database accesses: 2

Will be more performant, because it will use all the power of the graph and just follow the relationships from the node resulting in no more db accesses than matching the user node and thus no filtering on labels.

This was my 0,02€

Michal Bachman
  • 2,661
  • 17
  • 22
Christophe Willemsen
  • 19,399
  • 2
  • 29
  • 36
  • 6
    Excellent answer, and comprehensive. I learned a lot, and I like learning stuff. Seems to me some principles of good neo4j modeling strategy are still evolving. It would be good if the community could gather more of these modeling principles in the documentation, since many new users are graph neophytes. – FrobberOfBits Jan 15 '15 at 18:57
  • I'm honored to get such a comment by you. Thanks ;-) – Christophe Willemsen Jan 15 '15 at 19:03
  • 2
    Agreed, thank you for the thorough answer. I have some follow-up questions; too bad this tiny comment box is the only place for it. Point 2: I don't believe labels make *traversals* any faster either. Only relationship type matters then, right? Point 4: Why would specifying more labels be slower? Isn't Cypher smart enough to use the one with the lower cardinality first? In general, it might be nice to stick to the example in the original q: *just* a global lookup, *not* a traversal from e.g. a user node. So I think my takeaway for that scenario is: both options are equivalent? – Aseem Kishore Jan 15 '15 at 22:20
  • 1
    For point 2. The problem is that the indexed property will not be used, so if you use in your case only one label for all, he will do the filter on all the tweets. If you use a dedicated label, you will have built-in filter done by the label. For point 4 : He will match on the label and perform another filter for the other label called hasLabel(). I will edit the answer with results from the execution plan ;-) – Christophe Willemsen Jan 15 '15 at 22:31
  • I have added results of the PROFILE with a mini dataset, but it shows you the reality in matter of performance – Christophe Willemsen Jan 15 '15 at 22:50
  • Otherwise, if you just need to match the AwaitingModeration nodes and nothing more, YES labels and indexed properties will work the same – Christophe Willemsen Jan 15 '15 at 23:16
  • You three can also communicate in the neo4j-ecosystem google group :) – Michael Hunger Jan 15 '15 at 23:17
  • ha true :) never thought of it :) – Christophe Willemsen Jan 15 '15 at 23:19
  • I am curious though, @MichaelHunger and team: how much of the current discrepancy in performance is temporary? It seems like there's a lot of room for Cypher to be smarter / more optimized in these cases. Should we expect that in future Neo4j versions? – Aseem Kishore Jan 30 '15 at 17:45
  • Interesting, I've tried to reproduce the last example query and compared it to a query using labels and came to the conclusion that label's use less DbHits. Actually I am wondering why your last query doesn't trigger any NodeByLabelScan as you specify a label for n:User. Also, In my case `MATCH (n)-[:DRAFTED]->(post)` leads to `Expand(All)`, which actually does hit the db. Have you done some special magic to avoid this or did Neo4J changed its operation procedure? – F Lekschas Sep 09 '15 at 18:22
  • Yes this post reflects how the queries were running in Neo4j 2.1.6 at that time, Cypher with the new COST PLANNER introducted in neo2.2.X has become more and more smart for those things now. – Christophe Willemsen Sep 09 '15 at 18:24