0

I have 473639 nodes and 995863 parent->child relations in mysql table.

Using both normal and batch operation to fetch data, create node and relation, but both type of operations are slow. Is there any workaround to make this process faster?

Code is given below

import MySQLdb as my
from py2neo import neo4j, node, rel

def conn(query):
    db = my.connect(host='localhost',
                    user='root',
                    passwd='root',
                    db='localdb')
    cur = db.cursor()
    cur.execute(query)
    return db, cur

query = 'select * table1'
db, cur = conn(query)
d = dict()

graph = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(graph)


def create_node(a):
    if a not in d:
        try:
            A = graph.create(node(name=str(a)))

            # for batch operation
            #A = batch.create(node(name=str(a)))

            d[a] = A
        except Exception, e:
            print e
    else:
        A = d[a]
    return A

cnt = 1

# create node

for row in cur.fetchall():
    a,b = get_cat(row[0]), get_cat(row[1])
    try:
        A, B = create_node(a), create_node(b)
        rels.append((A,B))
    except Exception, e:
        print e


#create relations

for item in rels:
    a = item[0]
    b = item[1]
    graph.create(rel(a,"is parent of",b))

    # for batch operation
    #batch.create(node(name=str(a)))


#res = batch.submit()
#print res

print 'end'
Quazi Marufur Rahman
  • 2,603
  • 5
  • 33
  • 51

1 Answers1

0

Batch

A batch will be much faster than creating single nodes. But if you run a batch, you should submit it every couple hundred items. When the batch is to big, it gets slower. Try something like:

graph = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(graph)

i = 0
results = []

for item in rels:
    a = item[0]
    b = item[1]
    batch.create(rel(a,"is parent of",b))

    # submit every 500 steps
    if i % 500 == 0:
        # collect results in list
        results.extend(batch.submit())
        # reinitialize and clear batch 
        batch = neo4j.WriteBatch(graph)

# submit last items         
results.extend(batch.submit())

Cypher transactions

A good alternative are Cypher transactions. For me, they run a bit faster, but you have to write Cypher queries. For simple creation of items this is obviously more complicated than using py2neo nodes/rels. But it might come in handy for other operations (e.g. MERGE to update nodes). Keep in mind that you also have to .execute() the transaction regularly, if it's too big it slows down.

session = cypher.Session("http://localhost:7474")
tx = session.create_transaction()

# send three statements to for execution but leave the transaction open
tx.append("MERGE (a:Person {name:'Alice'}) "
          "RETURN a")
tx.append("MERGE (b:Person {name:'Bob'}) "
          "RETURN b")
tx.append("MATCH (a:Person), (b:Person) "
          "WHERE a.name = 'Alice' AND b.name = 'Bob' "
          "CREATE UNIQUE (a)-[ab:KNOWS]->(b) "
          "RETURN ab")
tx.execute()

With both transactions and batches I write millions of nodes/relationships in a few minutes. You have to try different batch/transaction sizes (e.g. from 100 to 5000), I think this depends on the amount of memory neo4j is using.

Martin Preusse
  • 9,151
  • 12
  • 48
  • 80
  • I have already tested batch submit. batch.submit() doesn't clear the batch. You need to reinitialize the batch to free it after batch.submit(). This is not as faster as I think. May be there is a memory leak type issue in both batch and normal create operation. This process consumes my 4gb RAM after executing around 0.4M relationship. I am trying to use batch-importer now. – Quazi Marufur Rahman May 20 '14 at 09:05
  • Ok, you can easily reinitialize the batch after submitting, see edit. There are different ways to iterate the list and submit/reinitialize every N steps. You could split the list beforehand and use chunks etc. The create operations are fine, I am doing this a lot and it works. Is Python consuming your memory or neo4j/Java? – Martin Preusse May 20 '14 at 09:13
  • py2neo, batch.create(), batch.submit() – Quazi Marufur Rahman May 20 '14 at 09:19
  • What's that supposed to mean :) ? – Martin Preusse May 20 '14 at 09:19
  • I haven't understand your last question :) – Quazi Marufur Rahman May 20 '14 at 09:22
  • You said your process is consuming your memory. I asked which process, i.e. if Python uses your memory (pointing towards a problem in Python code) or if neo4j/Java use the memory (pointing towards a huge batch or transaction that should be split in parts). – Martin Preusse May 20 '14 at 09:26