Performance of Java API versus Python with Cypher for Neo4J

Question

I am working with an application that uses a Neo4J graph containing about 10 million nodes. One of the main tasks that I run daily is the batch import of new/updated nodes into the graph, on the order of about 1-2 million. After experimenting with Python scripts in combination with the Cypher query language, I decided to give the embedded graph with Java API a try in order to get better performance results.

What I found is about a 5x improvement using the native Java API. I am using Neo4j 2.1.4, which I believe is the latest. I have read in other posts that the embedded graph is a bit faster, but that this should/could be changing in the near future. I would like to validate my findings with anyone who has observed similar results?

I have included snippets below just to give a general sense of methods used - code has been greatly simplified.

sample from cypher/python:

cnode = self.graph_db.create(node(hash = obj.hash,
    name = obj.title,
    date_created = str(datetime.datetime.now()),
    date_updated = str(datetime.datetime.now())
))

sample from embedded graph using java:

final Node n = Graph.graphDb.createNode();
for (final Label label : labels){
    n.addLabel(label);
}
for (Map.Entry<String, Object> entry : properties.entrySet()) {
    n.setProperty(entry.getKey(), entry.getValue());
}

Thank you for your insight!

Nigel Small · Accepted Answer · 2014-09-22T18:56:35.913

What you're actually doing here is comparing the speeds of two different APIs and merely using two different languages to do that. Therefore, you're not comparing like for like. The Java core API and the REST API used by Python (and other languages) have different idioms, such as explicit vs implicit transactions. Additionally, network latency associated with the REST API will make a great difference, especially if you are using one HTTP call per node created.

So to get a more meaningful performance comparison, make sure you are comparing like for like: use Java via the REST API perhaps or use Cypher for both tests.

Hint 1: you will get better performance in general over REST by batching up a number of requests into a single API call.

Hint 2: the REST API will never be as fast as the core API as the latter is native and the former has many more layers to go through.

score 0 · Answer 2 · answered Sep 22 '14 at 15:01

Without proper performance measurements, it's a hard to tell where the times goes. Generally, Python scripts are slower than Java but the language is faster to write code in, so you trade development speed for execution speed.

For example: Your code above takes one hour to run in Python and 12 minutes in Java. Writing the Python version took you 1 day, the Java version took you 3 days. That means you need to run the code at least 2 days / (60 - 12) minutes = 60 times to reach break even.

The example, of course, only makes sense as long as you can afford to wait the 48 minutes for Python to do its job. If your system is down for the time of the import, then 60 vs 12 minutes makes a huge difference - unless you can run it during the night when no one cares.

score 0 · Answer 3 · answered Sep 22 '14 at 15:04

0

If you play "the benchmarking game" with Java versus Python 3 (http://benchmarksgame.alioth.debian.org/u32/benchmark.php?test=all&lang=java&lang2=python3&data=u32), a 5-fold improvement for the Java version is certainly plausible.

answered Sep 22 '14 at 15:04

Stephen C

698,415
94
811
1,216

Performance of Java API versus Python with Cypher for Neo4J

3 Answers3