How does a data model affect neo4j write performance with CYPHER?

Question

I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:

I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.

In Solr I have documents that essentially look like this:

{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}

Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.

My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.

Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).

My goal was to load up Neo4J using this model, regardless of how naive it might be.

Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):

{
"query" : 
    "CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);", 
"params": {
    "lat": 40.1021614075, 
    "SST": 6.521100044250488, 
    "meta": "KCEJ_20140418v20001_1430", 
    "lon": -70.8780212402, 
    "time": 1397883480
    }
}

Note that I have actually removed quite a few parameters for testing neo4j.

Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:

~20 seconds using REST API

~45 seconds using BOLT

~70 seconds using py2neo

I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!

I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.

Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:

-increasing the open file limit from 1024 to 40000

-using ext4, and tweaking it as documented here

-increasing the page cache size to 16 GB (my system has 32)

So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:

CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)

as well as on my time index like so:

MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;

These simple queries would take literally several minutes just return possibly a few nodes.

In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).

At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.

My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.

Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?

As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?

Any advice, tips, comments are welcome. Thank you in advance.

EDIT:

As suggested in the comments, I have changed the behavior of my loading script.

Previously I was using python in this manner:

from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
    statement = build_statement(tuple)
    session.run(statement)
session.close()

With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.

My modified approach:

transaction = ''
for tuple in mydata:
    statement = build_statement(tuple)
    transaction += ('\n' + statement)
with session.begin_transaction() as tx:
    tx.run(transaction)
session.close()

I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).

What indexes are you using (what are the results of `:schema` in the browser)? Note that for insertion you should combine many `CREATE` statements into a single transaction to improve performance. You can perform ~10k database operations per transaction. For large initial bulk loading consider the [neo4j-import tool](http://neo4j.com/docs/operations-manual/current/#import-tool) — William Lyon, Jul 12 '16 at 02:33
The result of schema gives me `Indexes ON :record(time) ONLINE ON :record(meta) ONLINE (for uniqueness constraint) Constraints ON (record:record) ASSERT record.meta IS UNIQUE`... I haven't tried the bulk tool because I did not want to create so many intermediate files during processing. I suppose I can give it a shot if all else fails. I'm going to try joining the create statements as you suggested. — spanishgum, Jul 12 '16 at 17:10
Ok, I tried joining the statements, but I'm a bit stuck. I am editing my Q above with a follow up. — spanishgum, Jul 12 '16 at 17:47

How does a data model affect neo4j write performance with CYPHER?

0 Answers0