2

I'm currently exploring graph database potential for some processes in my industry. I've started with Neo4Jclient one week ago so I'm below the standard beginner :-)

I'm very excited about Neo4J but I'm facing huge performances issues and I need help.

The first step in my project is be to populate Neo4j from existing text files. Those files are composed of lines formatted using a simple pattern:

StringID=StringLabel(String1,String2,...,StringN);

For exemple, if I consider following line:

#126=TYPE1(#80,#125);

I would like to create one node with label "TYPE1", and 2 properties: 1) a unique ID using ObjectID: "#126" in above example 2) a string containing all parameters for future use: "#80,#125" in above example

I must consider that I will deal with multiple forward references, as in the exemple below:

#153=TYPE22('0BTBFw6f90Nfh9rP1dl_3P',#144,#6289,$);

The line defining the node with StringID "#6289" will be parsed later in the file.

So, to solve my file import problem, I've defined the following class:

public class myEntity
{
    public string propID { get; set; }
    public string propATTR { get; set; }

    public myEntity()
    {
    }
}

And thanks to forward references in my text file (and with no doubt my poor Neo4J knowledge...) I've decided to work in 3 steps:

First loop, I extract, from each line parsed from my file, strLABEL, strID and strATTRIBUTES, then I add one Neo4j node for each line using following code:

strLabel = "(entity:" + strLABEL + " { propID: {newEntity}.propID })";
graphClient.Cypher
    .Merge(strLabel)
    .OnCreate()
    .Set("entity = {newEntity}")
    .WithParams(new { 
        newEntity = new {
            propID = strID,
            propATTR = strATTRIBUTES
        }
    })
    .ExecuteWithoutResults();

Then I match all nodes created in Neo4J using following code:

var queryNode = graphClient.Cypher
    .Match("(nodes)")
    .Return(nodes => new {
        NodeEntity = nodes.As<myEntity>(),
        Labels = nodes.Labels()
    }
);

And finally I loop on all nodes, split the propATTR properties for each node and add one relation for each ObjectID found in propATTR using following code:

graphClient.Cypher
    .Match("(myEnt1)", "(myEnt2)")
    .Where((myEntity myEnt1) => myEnt1.propID == strID)
    .AndWhere((myEntity myEnt2) => myEnt2.propID == matchAttr)
    .CreateUnique("myEnt1-[:INTOUCHWITH]->myEnt2")
    .ExecuteWithoutResults();

When I explore the database populated using that code using Cypher, the resulting nodes and relations are the right ones and Neo4J execution speed is very fast any queries I've tested. It's very impressive and I'm convinced there is a huge potentiel for Neo4j in my industry.

But my big issue today is time required to populate the database (my config: win8 x64, 32Go RAM, SSD, intel core i7-3840QM 2.8GHz):

For a small test case (6400 lines) it's took me 13s to create 6373 nodes, and 94s more to create 7800 relations

On a real test case (40000 lines) it's took me 496s to create 38898 nodes, and 3701s more to create 89532 relations (yes: more than one hour !)

I've no doubt such poor performances are directly resulting from my poor neo4jclient knowledge.

It would be a tremendous help for me if the community can advise me on how to solve that bottleneck.

Thanks by advance for your help.

Best regards Max

max rebar
  • 23
  • 3
  • How often do you import data? Usually that's a one-time step (or few-time), and is not what people usually focus on, when thinking about performance. If you imported 89K relations in 3.7K sec, isn't that almost 25/second? And you're creating over 75 nodes per second on import too, right? Is that really that slow? And will you really need to repeat this process often? The other thing is, Neo4j has a built-in importer (using CSV). Have you tried with that, instead of node-by-node import via code? By the way: Your match code appears to not use label filtering. – David Makogon Aug 31 '14 at 15:51
  • 1
    First, thanks a lot for your answer. Depending of the product lifecycle steps, importing data can be several times a day (design phase) or a one-time step (operation phase, after handover). In current code, I use node labels (and there can be almost 200 distinct labels) to categorize them: is it a good choice or should I add a property string Category to each node ? Will try to use CSV import and keep you informed on results. Finally, I didn't used label during the import phase filtering because I need to match all nodes to add rels between them. – max rebar Sep 03 '14 at 04:43

1 Answers1

0

While I don't have the exact syntax in my head to write down for you, I would suggest you look at splitting your propATTR values when you read them initially, and storing them directly as an array/collection in Neo4j. This would hopefully then enable you to do your relationship creation in bulk within Neo4j, rather than the iterating the nodes externally and executing so many sequential transactions.

The latter part might look something like:

MATCH (myEnt1),(myEnt2) WHERE myEnt1.propID IN myEnt2.propATTR
CREATE UNIQUE (myEnt1)-[:INTOUCHWITH]->(myEnt2)

Sorry my Cypher is a bit rusty, but the point is to try to transfer the load fully into the Neo4j engine, rather than the continual round-trips between your application logic and the Neo4j server. I suggest that it's probably these round trips that are killing your performance, and not so much the individual work involved in each transaction, so minimising the number of transactions would be the way to go.

Pat
  • 409
  • 4
  • 15