3

UPDATE I have put up a follow up question that contains updated scripts and and a clearer setup on neo4j performance compared to mysql (how can it be improved?). Please continue there./UPDATE

I have some problems verifying the performance claims made in the "graph databases" book (page 20) and in the neo4j (chapter 1).

To verify these claims I created a sample dataset of 100000 'person' entries with 50 'friends' each, and tried to query for e.g. friends 4 hops away. I used the very same dataset in mysql. With friends of friends over 4 hops mysql returns in 0.93 secs, while neo4j needs 65 -75 secs (on repeated calls).

How can I improve this miserable outcome, and verify the claims made in the books?

A bit more detail:

I run the whole setup on a i5-3570K with 16GB Ram, using ubuntu12.04 64bit, java version "1.7.0_25" and mysql 5.5.31, neo4j-community-2.0.0-M03 (I get a similar outcome with 1.9)

All code/sample data can be found on https://github.com/jhb/neo4j-experiements/ (to be used with 2.0.0). The resulting sample data in different formats can be found on https://github.com/jhb/neo4j-testdata.

To use the scripts you need a python with mysql-python, requests and simplejson installed.

  • the dataset is created with friendsdata.py and stored to friends.pickle
  • friends.pickle gets imported to neo4j using import_friends_neo4j.py
  • friends.pickle gets imported to mysql using import_friends_mysql.py
  • I add indexes on t_user_friend.* in mysql
  • I added "create index on :node(noscenda_name) in neo4j

To make life a bit easier the friends.*.bz2 contain sql and cypher statements to create those datasets in mysql and neo4j 2.0 M3.

Mysql performance

I first warm mysql up by querying:

select count(distinct name) from t_user;
select count(distinct name) from t_user;

Then, for the real meassurment I do

python query_friends_mysql.py 4 10

This creates the following sql statement (with changing t_user.names):

select 
    count(*)
from
    t_user,
    t_user_friend as uf1, 
    t_user_friend as uf2, 
    t_user_friend as uf3, 
    t_user_friend as uf4
where
    t_user.name='person8601' and 
    t_user.id = uf1.user_1 and
    uf1.user_2 = uf2.user_1 and
    uf2.user_2 = uf3.user_1 and
    uf3.user_2 = uf4.user_1;

and repeats this 4 hop query 10 times. The queries need around 0.95 secs each. Mysql is configured to use a key_buffer of 4G.

neo4j performance testing

I have modified neo4j.properties:

neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M

and the neo4j-wrapper.conf:

wrapper.java.initmemory=2048
wrapper.java.maxmemory=8192

To warm up neo4j I do

start n=node(*) return count(n.noscenda_name);
start r=relationship(*) return count(r);

Then I start using the transactional http endpoint (but I get the same results using the neo4j-shell).

Still warming up, I run

./bin/python query_friends_neo4j.py 3 10

This creates a query of the form (with varying person ids):

{"statement": "match n:node-[r*3..3]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

after the 7th call or so each call needs around 0.7-0.8 secs.

Now for the real thing (4 hops) I do

./bin/python query_friends_neo4j.py 4 10

creating

{"statement": "match n:node-[r*4..4]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

and each call takes between 65 and 75 secs.

Open questions/thoughts

I'd really like see the claims in the books to be reproducable and correct, and neo4j faster then mysql instead of magnitudes slower.

But I don't know what I am doing wrong... :-(

So, my big hopes are:

  • I didn't do the memory settings for neo4j correctly
  • The query I use for neo4j is completely wrong

Any suggestions to get neo4j up to speed are highly welcome.

Thanks a lot,

Joerg

Community
  • 1
  • 1
Joerg Baach
  • 1,066
  • 8
  • 17
  • 1
    please note that the "Neo4j in Action" book used the embedded Java API, cypher is not optimized that much yet – Michael Hunger Jul 21 '13 at 20:42
  • 1
    Thanks for the hint. The 'Neo4j in Action' book actually uses the traversal api (and is hence comparing apples to oranges, imho). The graph database leaves out that bit completely. I doubt that I am able to use the traversal api though. Also, has thetraversal api has a different scaling behaviour? In the example querying 3 hops took 0.7 secs, querying 4 hops took 60 secs. I would love to see an example using the traversal api on (my) published sample data. – Joerg Baach Jul 22 '13 at 07:32
  • Even with the traversal API, 2.0-M03 is slower traversal-wise than 1.9 since it was released before any kind of performance ensuring began. M04 will probably be closer to 1.9, but doing performance measurements on milestone releases means that all bets are off. – Mattias Finné Jul 23 '13 at 18:02

2 Answers2

1

2.0 has not been performance optimized at all, so you should use 1.9.2 for comparison. (if you use 2.0 - did you create an index for n.noscenda_name)

You can check the query plan with profile start ....

With 1.9 please use a manual index or node_auto_index for noscenda_name.

Can you try these queries:

start n=node:node_auto_index(noscenda_name={target})
match n-->()-->()-->m
return count(*);

Fulltext indexes are also more expensive than exact indexes, so keep the exact auto-index for noscenda_name.

can't get your importer to run, it fails at some point, perhaps you can share the finished neo4j database

python importer.py
reading rels
reading nodes
delete old
Traceback (most recent call last):
  File "importer.py", line 9, in <module>
    g.query('match n-[r]->m delete r;')
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 99, in query
    return self.call(payload)
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 71, in call
    self.transactionurl = result.headers['location']
  File "/Library/Python/2.7/site-packages/requests-1.2.3-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'location'
Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • 1
    Thanks a lot for looking into this. First, the published scripts are working on the transaction http endpoint (of 2.0.0M3). I have uploaded a data dir for 1.9.2 to https://github.com/jhb/neo4j-testdata/blob/master/100000_friends_data.1.9.2.tar.bz2. I use indexes for 2.0 and 1.9.2 for looking up noscenda_name - this lookup time should be constant in any case, though. Querying 3 hops still only needs 0.7 secs (see original question). I don't think that the lookup of the start node plays a major role here. – Joerg Baach Jul 22 '13 at 07:40
  • 1
    "this lookup time should be constant in any case" - I meant that its one lookup, and the time for it is the same whether the query is for 2,3,4 or * hops. – Joerg Baach Jul 22 '13 at 07:51
  • 1
    The 1.9.2 data I posted actually doesn't contain an node_auto_index. Will fix that. On the current 1.9.2 sample running ` start n=node(146803) match n-->()-->()-->()-->m return count(*);` still needs around 40 secs, just doing `start n=node(146803) match n-->()-->()-->m return count(*);` nees 0.5 secs. A factor of 80 between the two, while the graph database book speaks of a factor of 7. And still no performance gain over mysql whatsoever. – Joerg Baach Jul 22 '13 at 08:07
  • 1
    the data is updated, contains a node_auto_index and conf data now – Joerg Baach Jul 22 '13 at 10:57
0

Just to add to what Michael said, in the book I believe the authors are referring to a comparison that was done in the Neo4j in Action book - it's described in the free first chapter of that book.

At the top of page 7 they explain that they were using the Traversal API rather than Cypher.

I think you'll struggle to get Cypher near that level of performance at the moment so if you want to do those types of queries you'll want to use the Traversal API directly and then perhaps wrap it in an unmanaged extension.

Mark Needham
  • 2,098
  • 17
  • 19
  • 1
    Hi Mark, thanks for your reply. As answered above - the Neo4j in Action book actually seems to use the traversal api (which I overlooked). So it is comparing apple to oranges. At least in the graph databases book its implied that the comparison is done using query languages... In any case - I am not a java coder, so for me its really hard to verify the claim of the traversal api. Does this really *scale* completely different than cypher? That would mean that cypher was just a little toy. And the examples, slides and presentations I have seen from neo4j all emphasize the use of cypher... – Joerg Baach Jul 22 '13 at 07:49