4

I have a program that opens an embedded database and runs several queries on it. I am using one ExecutionEngine and reusing it for each query. Just running the first 3 queries, which are the simplest, takes - well, I don't know how long it takes because I stopped it after about 1/2 an hour, after which it had done only 2 queries. I have had issues with Cypher being slow on this graph before, but it's never been this bad. I am using the API for some more complicated queries, but I'd rather use Cypher for these because they are so simple. I also have some other queries that I would like to run that basically need to run through and return most of the database, some nodes multiple times.. I know this is not recommended, but I need everything laid out according to their relationships - getting every node in the graph will be entirely useless. That query take a few days, at the rate I'm going. I have no problem with what other people consider "slow" (e.g. 500 ms), b/c this is not a real-time application, but 20 minutes is excessive. What's going wrong? What am I doing wrong?

My database contains several million nodes and at least as many relationships. Neo4j is supposed to be able to handle graphs that large easily. Why am I getting such crazily long execution times?

If anyone can help me with this (maybe my queries are all wrong?), I'd really appreciate it!

Thanks, bsg Here is the code for the first three queries that take 30 minutes+ together. It runs each one and prints the result (a simple count) to a file.

    ExecutionEngine eng = new ExecutionEngine(graphdb);

    String filepath = resultstring + "basicstats.txt";
    PrintWriter basics = new PrintWriter(resultstring + "basicstats.txt");

   String querystring = "START user=node:userIndex(\"Username:*\")" +
    " WHERE has(user.FullNodeCreationTime) "
    + " RETURN COUNT(user) AS numcrawled";

    ExecutionResult result = eng.execute(querystring);

    basics.print("Number of users crawled: ");
    basics.println(result.iterator().next().get("numcrawled"));


    String otherusers = "START user=node:userIndex(\"Username:*\")" +
            " WHERE NOT has(user.FullNodeCreationTime)" +
            " RETURN COUNT(user) AS numtouched";

    result = eng.execute(otherusers);
    basics.print("Number of users touched (not crawled): ");
    basics.println(result.iterator().next().get("numtouched"));

    String partialinfousers = "START user=node:userIndex(\"Username:*\")" +
    " WHERE NOT has(user.FullNodeCreationTime) AND NOT has(user.NumFollowers)" +
            " RETURN COUNT(user.Username) AS numcrawled";
    result = eng.execute(partialinfousers);


    basics.print("Number of users with partial info: ");
    basics.println(result.iterator().next().get("numcrawled"));

    basics.close();
bsg
  • 825
  • 2
  • 14
  • 34
  • Are you sure the logging isn't the bottleneck? What are the query times when you remove the PrintWriter? – tstorms Jan 14 '14 at 09:33
  • Each query should return a single value, so the file is written to just 3 times. I can't imagine that that's such a huge bottleneck. – bsg Jan 14 '14 at 15:56
  • What version of Neo4j are you using? You might want to consider using 2.0, so you can use labels. In your example, you could create a USER label, thus eliminating the use of a lucene query. This will improve your query performance. – tstorms Jan 14 '14 at 16:02
  • I am using 1.9, but I can't really make huge modifications to the database right now. – bsg Jan 14 '14 at 20:01

2 Answers2

0

How big is your database? How many users do you have in your userIndex ?

What is your memory / heap configuration? I assume you run into a lot of GC issues, as Cypher tries to fit the whole db into memory for your queries.

Also on cold caches and little memory you basically measure disk speed to pull the data into memory.

You can combine your queries into one.

START user=node:userIndex("Username:*")
RETURN has(user.FullNodeCreationTime),has(user.NumFollowers), COUNT(*) AS num

that should return 4 entries for the 4 combos which you can easily use / aggregate

All of these are not graph queries, and also graph global queries. So neither Neo4j nor Cypher are optimized for them :)

Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • That's smart. Thanks. I know that they're not exactly graph queries (my graph queries perform even worse, because they're actually touching the whole graph), but I thought using and index was supposed to speed things up... And there are about 500K users in the index, but only about 5k of them mee the criteria. – bsg Jan 15 '14 at 13:57
  • And I'm using -XMx 1024M to run - my system won't let me use more than that. – bsg Jan 15 '14 at 13:58
0

First, run your query from the shell using EXPLAIN to see how the query is being run. That's alway the first place to start understanding performance issues.

Second, if I'm understanding your first query correctly you simply want to know how many nodes have the property FullNodeCreationTime. You existing query isn't really using an index in an optimal way as you aren't looking for a specific value. It also appears as if you are looking at a single node type, meaning a node with a specific label such as User. If that is correct, then I would create an index on User.FullNodeCreationTime and simply run this query:

match (u:User) where has (u.FullNodeCreationTime) return count(u)

That should perform much better.

Clark Richey
  • 382
  • 1
  • 10