1

In my project I am using spring-data-neo4j 4.2.0.M1 with neo4j-ogm 2.0.4. Initially this was using an embedded neo4j-instance, but in the course of investigation for this issue I've migrated to a dedicated neo4j-instance (running on the same machine though) using the Bolt protocol.

I am continously inserting data, basically as it becomes available to my application (so I can't use batch-insert). After startup this works fine and saving an instance of my NodeEntity takes ~60ms which is perfectly fine for my use case. However this slowly degrades over time. After 10-20 minutes, this slows down to about 2s per save, which is not so great anymore. The time seems to peak here and doesn't decrease much more.

Initially I assumed that this was caused by the embedded instance being too small, since I saw repeated messages about GC pauses being reported by neo4j. I've then migrated to a dedicated instance which is much bigger and those GC warnings don't show up anymore. The degradation still occurs though.

Store sizes as reported by neo4j:

Array Store 8.00 KiB
Logical Log 151.36 MiB
Node Store 40.14 MiB
Property Store 1.83 GiB
Relationship Store 742.63 MiB
String Store> Size 120.87 MiB
Total Store Size 4.55 GiB

The instance is configures as follows:

dbms.memory.pagecache.size=5g
dbms.memory.heap.initial_size=4g
dbms.memory.heap.max_size=4g
dbms.jvm.additional=-XX:+UseG1GC

Using YourKit profiler (sampler mode!) I can see that most of the time seems to be spent by neo4j-ogm's EntityGraphMapper, specifically in

org.neo4j.ogm.context.EntityGraphMapper#haveRelationEndsChanged

YourKit Profiler

The NodeEntity that is being saved usually has about ~40 relationships to other nodes, most of them modeled as RelationshipEntity. In an earlier phase I had already noticed that saving the entities was quite slow, as too many related (but unchanged) entities were mapped as well. Since then I am using a depth of 1 when saving. The continuous operations that causes the NodeEntitites to be saved uses a transaction size of 200 entities.

I am not convinced yet, that neo4j-ogm is actually the cause for the slowdown, since I don't see what changes compared to the good initial results. In cases like this I usually suspect memory leaks/pollution, but all the monitoring results for this look good in my application. For the neo4j server instance I don't really know where to look for such information apart from the debug.log.

All in all I've spent quite some time investigating this already and don't know what else to look at. Any thoughts or suggestions? I am happy to provide additional information.

Edit: Follwing @vince's input, I've had another look at memory distribution and found that in fact the Neo4jSession had grown quite a lot after letting the application run for ~3h:

neo4j-ogm-memory

At that time the heap was 1,7 GB big, out of which 70% referenced live data. Out of that, about 300mb were currently referenced (and kept alive) by the Neo4jSession. This may indicate that it has grown too big. How can I manually interfere here?

geld0r
  • 800
  • 10
  • 22
  • Are you creating a new session for each transaction (batch of 200 entities), or using a single session? – Vince Nov 08 '16 at 21:00
  • I am using the same session (I think). I don't have any manual handling for sessions and also use the default scope. From what I understood from the documentation, this should be beneficial performance wise for longer running operations? I don't expect any updates outside of my worker thread in the meantime. – geld0r Nov 08 '16 at 21:55
  • 2
    Entities stick around in the session until they get garbage collected. There may be some performance impact in `haveRelationEndsChanged` if you're loading many thousands of entities, so it may be worth doing `session.clear()` between each transaction and see if this helps. – Vince Nov 08 '16 at 22:35
  • @Vince: Great suggestion, this seems to have done the trick! The same test that earlier ran for 3 hours now just took 40 minutes and had a constant time for all database inserts. Since this is now faster than requests come in, the problem is solved. I'll try to increase the transaction size though as I guess that 200 is rather too small. Please add your suggestion as an answer and I'll accept it. – geld0r Nov 09 '16 at 20:55

3 Answers3

3

Hope it is not too late to help with this issue.

I've faced the same situation recently when saving a node with ~900 relationships within a Set and could get it execute from ~5 seconds to 500ms. I was initially using neo4j-ogm 2.1.3 and migrated to 3.0.0 just now. Even though 3.0.0 is much faster, the performance gain was similar across the two versions.

Here's some pseudo-code (I cannot share the real code now):

@NodeEntity(label = "MyNode")
public class MyNode {
    @GraphId
    private Long id;

    @Index(unique = true, primary = true)
    private String myUniqueValue;

    private String value;

    @Relationship(type = "CONNECTS_TO")
    private Set<MyRelationship> relationships;
    // constructors, getters, setters
}

@Relationship(type = "CONNECTS_TO")
public class MyRelationship {

    @GraphId
    private Long id;

    @StartNode
    private MyNode parent;

    @EndNode
    private MyNode child;
    // constructors, getters, setters
}

Notice that MyNode has an indexed/unique field where I have full control over the value. neo4j-ogm will use it to determine whether it should execute a CREATE or MERGE statement. In my use case, I want the merge to happen if the node already exists.

Relationship creation, on the other hand, relies on the node id (@GraphId field). Here's a small snippet of the statement generated that creates it:

UNWIND {rows} as row MATCH (startNode) WHERE ID(startNode) = row.startNodeId MATCH (endNode) WHERE ID(endNode) = row.endNodeId...

In the slow mode, neo4j-ogm will take care of verifying whether the relationship or the nodes within it are already saved and will retrieve the ids necessary to create the node. This is the operation that you captured in YourKit.

An example that executes slowly:

void slowMode() {
    MyNode parent = new MyNode("indexed-and-unique", "some value");
    for (int j = 0; j < 900; j++) {
        MyNode child = new MyNode("indexed-and-unique" + j, "child value" + j);
        parent.addRelationship(new MyRelationship(parent, child));
    }
    session.save(parent); // save everything. slow.
}

The solution I've found was to break these operations into three parts:

  • Save the parent node only

  • Save the child nodes

  • Save the relationships

This is much faster:

void fastMode() {
    MyNode parent = new MyNode("indexed-and-unique", "some value");
    for (int j = 0; j < 900; j++) {
        MyNode child = new MyNode("indexed-and-unique" + j, "child value" + j);
        parent.addRelationship(new MyRelationship(parent, child));
    }
    session.save(parent, 0); // save only the parent
    session.save(getAllChildsFrom(parent), 0); // save all the 900 childs
    // at this point, all instances of MyNode will contain an "id". time to save the relationships!
    session.save(parent);
}

One thing to pay attention: neo4j-ogm 2.1.3 did not execute a single batched statement when saving the collection of nodes (session.save(getAllChildsFrom(parent), 0)) which is still chatty and slow, but not as slow as before. Version 3.0.0 fixes that.

Hope it helps!

2

Entities stick around in the session until they get garbage collected. There may be some performance impact in haveRelationEndsChanged if you're loading many thousands of entities, so it may be worth doing session.clear() between each transaction and see if this helps

Vince
  • 2,181
  • 13
  • 16
1

some time ago we have practically the same situation, when we needed to store a large amount of data to neo4j. We analyzed different approaches how to handle this. So we found some solutions how to speed up inserting data to neo4j.

  1. Use native neo4j java driver instead of spring-data. First of all it is async api, and if data availability for select is not critical at the moment it can help.

  2. Use transactions for inserting a number of records (for example 1000 inserts per transaction). It will speed up insertion because after any transaction commit neo4j tries to recalculate indexes with lucene and it takes time. In your case (using spring-data) any insert is executed in separate transaction.

D. Krauchanka
  • 264
  • 3
  • 15
  • I would really like to avoid writing manual queries for this diverse task of importing. Spring-data-neo4j/neo4j-ogm allow to simplify this quite a bit. I'll keep that suggestion in mind though for similar cases where the updates to be processed are more similar. – geld0r Nov 09 '16 at 20:58