0

I need to connect a node of Type "Application" with users created on the systems the application is installed on.

Normally a "Application" is installed on cluster pair (2 systems) and has only the default system as property "n.System".

A "Application" has a naming schema: <prefix><number of 5 digits>, e.g. yxz12345 or ab23456, etc.

On each system are user accounts created (sometimes up to 100 on each system). Some of them have a naming schema like the "Application": <prefix><number of 5 digits>, e.g. sdjhg12345 or tzrw23456, etc. some are not.

An "Application" as a "User" property that can contain the user its running on u.Name = n.User OR it uses all "Users" that have the same 5 digits after its prefix right(u.Name, 5) = right(n.Name, 5).

Usernames are shared across all systems, so we only need to link the users that are on the same systems.

I'm using following query to create the relationship:

MATCH (n:Application {Id: 1})
WITH n
MATCH (s:System)-[:ClusteredWith]-(c:System)
WHERE s.Name = n.System
WITH n, s, c
MATCH (u:User)
WHERE 
    ((u)-[:CreatedOn]->(s) OR (u)-[:CreatedOn]->(c))
AND
    (u.Name = n.User OR right(u.Name, 5) = right(n.Name, 5))

CREATE UNIQUE (u)-[:UsedFor]->(n)

There're 8000 Systems, 100000 Users and 30000 Applications in the neo4j database currently.

I've auto property index on Id, Name, User

This query is extremly slow on a very powerful hardware (Up to 96 GB RAM, etc). I'm using the Neo4jClient version 1.0.0.646 and Neo4j 2.0.1

How to get this query fast?

EDIT: Query Plan Added:

==> EmptyResult(_rows=0, _db_hits=0)
==> UpdateGraph(commands=[{"action": "CreateUnique", "identifiers": ["u", "n", "  UNNAMED305"]}], _rows=0, _db_hits=0)
==>   Eager(_rows=0, _db_hits=0)
==>     Filter(pred="((nonEmpty(PathExpression((u)-[  UNNAMED165:CreatedOn]->(s), true)) OR nonEmpty(PathExpression((u)-[  UNNAMED196:CreatedOn]->(c), true))) AND (Property(u,Name(0)) == Property(n,User(33)) OR RightFunction(Property(u,Name(0)),Literal(5)) == RightFunction(Property(n,Name(0)),Literal(5))))", _rows=0, _db_hits=29774466)
==>       NodeByLabel(identifier="u", _db_hits=0, _rows=4962411, label="User", identifiers=["u"], producer="NodeByLabel")
==>         ColumnFilter(symKeys=["n", "c", "s", "  UNNAMED58"], returnItemNames=["n", "s", "c"], _rows=183, _db_hits=0)
==>           Filter(pred="(Property(s,Name(0)) == Property(n,System(36)) AND hasLabel(s:System(0)))", _rows=183, _db_hits=366)
==>             SimplePatternMatcher(g="(c)-['  UNNAMED58']-(s)", _rows=183, _db_hits=4880)
==>               NodeByLabel(identifier="c", _db_hits=0, _rows=2915, label="System", identifiers=["c"], producer="NodeByLabel")
==>                 Filter(pred="Property(n,Id(0)) == Literal(1)", _rows=1, _db_hits=702)
==>                   NodeByLabel(identifier="n", _db_hits=0, _rows=702, label="Application", identifiers=["n"], producer="NodeByLabel")

This is an query for an application on 2 systems but without a matching user (currently)

dna
  • 1,498
  • 1
  • 14
  • 35
  • 1
    Bulk update always tend to be a bit slow. You might want to give the PERIODIC COMMIT statement a try. It is released recently in 2.1.0.M1. – tstorms Mar 07 '14 at 08:39
  • Only the initial rebuild of the database is a "bulk" import and i'll give it a try .. but during normal operations (new applications are not so often) it is slow and sometimes times out – dna Mar 07 '14 at 08:45

1 Answers1

1

Profile your query (how?) and review the execution plan with the following in mind:

  1. Is :Application(Id) unique?
  2. How many database hits are needed for the comparison :Application(System) = :System(Name)? Could they be avoided entirely by relating (:Application)-[:InstalledOn]->(:System)?
  3. Do you have duplicate paths/result items bound at WITH n, s, c?
    • ~ The pattern (s:System)-[:ClusteredWith]-(c:System) is symmetric
    • ~ Duplicates mean the rest of the query is executed multiple times
  4. How many users are matched? Is the index being used? It may not be since the WHERE clause is complex, possibly too complex for the engine to refactor to an exact lookup. Can you refactor it to allow an initial index lookup, or even better, reach the users via (:System)<-[:CreatedOn]-(:User) to avoid handling all the irrelevant users?

My guess is the big bottle neck is the user lookup. If you also have more than one result item at that point in the query, the problem is compounded. The following is guesswork, but maybe you can adapt it and get it to work.

MATCH (app:Application {Id: 1})
MATCH (:System {Name:app.System})-[:ClusteredWith*0..1]-(sys:System)<-[:CreatedOn]-(user)
WHERE  (user.Name = app.User OR right(user.Name, 5) = right(app.Name, 5))
CREATE UNIQUE (user)-[:UsedFor]->(app)
Community
  • 1
  • 1
jjaderberg
  • 9,844
  • 34
  • 34
  • Hi, thanks for your helpful answer. I added the current query plan and i'll try out your suggestions next week. – dna Mar 07 '14 at 11:54
  • 1
    You might post your index configuration as well (`:schema` in neo4j-browser, `schema` in neo4j-shell), from the execution plan it doesn't look like any indices are being used. – jjaderberg Mar 07 '14 at 12:01
  • You're right: ==> No indexes ==> ==> No constraints Why, when autoindex is enabled? – dna Mar 07 '14 at 12:03
  • 1
    You are probably using legacy indexing as opposed to the new label based schema indexing (which is also "automatic"). Take a look at [this question](http://stackoverflow.com/questions/22049121/does-label-mechanism-provide-auto-indexing-features-when-using-neo4j-java-api) and [this manual page](http://docs.neo4j.org/chunked/stable/cypher-schema.html) for info on schema indexing. – jjaderberg Mar 07 '14 at 12:28
  • Your suggestions was helpful and it runs now very fast. I added Neo4j 2.0 indexes and rewrote the query to use relations like your example. – dna Mar 10 '14 at 16:10