Need assistance with reducing the processing time for merging nodes and building relationships

Question

I am new to Cypher. I am building a simple graph using GrapheneDB and py2neo (version 2.0.2)

In my simple graph I have Repository, Organization & People nodes. IN_ORGANIZATION and IS_ACTOR are two types of relationships. Below is the code snippet for creating nodes and relationships (entire code on GitHub , refer to lines 88 - 108)

    #Create repository node if one does not exist
    r = graph.merge_one("Repository", "id", record["full_name"])
    #Update timestamp with time now in epoch milliseconds
    r.properties["created_at"] = MyMoment.TNEM()
    #Apply property change
    r.push()   
    ...
    #Create organization node if one does not exist
    o = graph.merge_one("Organization", "id", record["organization"])
    #Update timestamp with time now in epoch milliseconds      
    o.properties["created_at"] = MyMoment.TNEM()
    #Apply property change
    o.push()
    rel = Relationship(r,"IN_ORGANIZATION",o)
    #create unique relation between repository and organization
    #ignore if relation already exists
    graph.create_unique(rel)
    ...
    #Create actor relation if one does not exist
    p = graph.merge_one("People", "id", al)
    #Update timestamp with time now in epoch milliseconds          
    p.properties["created_at"] = MyMoment.TNEM()
    p.push()
    rel = Relationship(r,"IS_ACTOR",p)
    #create unique relation between repository and people
    #ignore if relation already exists
    graph.create_unique(rel)

Above code works very well on a small data set. When the data set grows where ~20K nodes and ~15K relations are created/merged each hour the processing time is longer than an hour (sometimes several hours). I need to reduce the processing time. What are other alternate options I can explore? I was thinking of batch mode? How can I use it with merge_one and create_unique? Any ideas?

Hello, I'm Alberto, one of the founders at GrapheneDB. We've found LOAD CSV to be a pretty good way of loading data directly into your instance. Where are you running this script from: your local machine or a server/app server on AWS? Please consider the network latency. Also, you don't seem to be using any type of transactions. I would recommend opening a transaction, making thousands of operations, then commiting the transaction, to reduce I/O wait. Also, I don't know the amount of data you're trying to load, but you might want to adjust the heap/cache settings too. — albertoperdomo, Mar 05 '15 at 09:24
Also forgot to ask: Do you have label indexes on the properties used in the MERGE clauses to locate the nodes? This would improve the time it takes to search for the node in the graph. If not present. Neo4j is forced to do a label scan (iterate through all the nodes with that particular label), so the time will increase as the number of inserted nodes grows. — albertoperdomo, Mar 05 '15 at 09:30
@albertoperdomo thanks! `LOAD CVS` made the difference. In less than a couple of minutes my data is now processed (based on early testing). I have also inclued `uniqueness constraint` in the schema. — harishvc, Mar 08 '15 at 06:50
Glad you were able to solve the issue. Another approach would be using the transaction endpoint and grouping n operations per transaction, but if this is over the wire there will be always a certain impact on performance. For best results use LOAD CSV or run the import script locally, then upload the store file to the remote instance. — albertoperdomo, Mar 09 '15 at 09:05

Need assistance with reducing the processing time for merging​​ nodes and building relationships

0 Answers0

Need assistance with reducing the processing time for merging nodes and building relationships