3

I'm seeing some rather improbable performance results from the embedded Neo4j, on the surface it's orders of magnitude slower than expected so I'm assuming I'm "doing it wrong", although I'm not doing anything complicated.

I'm using the latest embedded python bindings for Neo4j (https://github.com/neo4j/python-embedded)

from neo4j import GraphDatabase
db = GraphDatabase('/tmp/neo4j')

I've created fake 1500 products with simple attributes:

fake_products = [{'name':str(x)} for x in range(0,1500)]

... and created nodes out of them that I connected to a subreference node:

with db.transaction:
    products = db.node()
    db.reference_node.PRODUCTS(products)

    for prod_def in fake_products:
        product = db.node(name=prod_def['name'])        
        product.INSTANCE_OF(products)

Now with what looks, to me, as almost exactly the same kind of code I've seen in the documentation:

PRODUCTS = db.getNodeById(1) 
for x in PRODUCTS.INSTANCE_OF.incoming: 
    pass

... iterating through these 1500 nodes takes >0.2s on my Macbook Pro. WHAT. (EDIT: I of course ran this query a bunch of times so at least in the python bindings it's not a matter of cold caches)

I amped it up to 15k, it took 2s. I downloaded Gremlin and issued an equivalent query to investigate if it's neo4j or the python bindings:

g.v(1).in("INSTANCE_OF")

.. it seems it took about 2s on the first try, on the second run it seemed to complete almost immediately.

Any idea why it's so slow? The results I'm getting have got to be some kind of a mistake on my part.

Wojtek
  • 31
  • 3

1 Answers1

1

This is Neo4j loading data lazily and not doing any prefetching. On the first run, you are hitting the disk, on the second, the caches are warm, which is your real production scenario.

Peter Neubauer
  • 6,311
  • 1
  • 21
  • 24
  • Thanks for the answer Peter, I'm assuming the way I created & connected the data was OK? But regarding your answer - that's not what I'm seeing in the Python bindings results at least - the above traversal ran multiple times in a row all take the same amount of time. – Wojtek Feb 03 '12 at 15:27
  • So, in gremlin/groovy/java land it is fast, but not through Python? – Peter Neubauer Feb 04 '12 at 14:49
  • Argh, trying to install JPype without success on OSX Lion, want to reproduce it :/ – Peter Neubauer Feb 05 '12 at 16:06
  • Here's my report on the github page for neo4j/python-embedded and jakewins' reply: https://github.com/neo4j/python-embedded/issues/15 I think I followed the instructions here to get JPype to work: http://stackoverflow.com/questions/8525193/cannot-install-jpype-on-os-x-lion-to-use-with-neo4j – Wojtek Feb 05 '12 at 17:37
  • Sorry for quite idiotic reply, didn't realize pressing enter sends the comment and it was sent before I could say more :) Anyway, thanks for your interest. Querying 15k nodes in gremlin/groovy is super fast once the caches are warm, so it's got to be the python bindings. I'm really interested in using neo4j (and gremlin, something the python bindings don't seem to allow). I'm now trying an approach where my python web app connects to a zeromq server socket hosted by jython that's calling a groovy class to retrieve results. Might be a pain to maintain this but I see no other way right now – Wojtek Feb 05 '12 at 17:47
  • Ok, we are very interested in your findings then, and maybe you could contribute the bindings so they can be maintained by the community later? – Peter Neubauer Feb 06 '12 at 23:09