Neo4j/Py2Neo timeout issue when importing large CSV files

Question

When importing data from large CSV files (>200MB) into Neo4j, the response ends up hanging. The query does complete, and all records are imported, however there seems to be some sort of response timeout which results in no indication that the import query has completed. This is an issue as we cannot automate importing multiple files into Neo4j, since the script continues waiting for the query to finish, even though it already has.

Importing 1 file takes around 10-15 minutes.

No errors are thrown anywhere in the pipeline, everything simply hangs. I can only tell when the process has completed as the VM CPU activity dies down.

This process does work on smaller files, and does send back an acknowledgement when the previous file has finished being imported, and moves onto the next.

I have tried running the scripts from both Jupyter notebook as well as a python script directly on the console. I have also even tried running the query directly on Neo4j through the browser console. Each way results in hanging queries, therefore I am not sure if the issue is coming from Neo4j or Py2Neo.

Example query:

USING PERIODIC COMMIT 1000
LOAD CSV FROM {csvfile}  AS line
MERGE (:Author { authorid: line[0], name: line[1] } )

Modified python script using Py2Neo:

from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name="<name>",account_key="<key>")
generator = blob_service.list_blobs("parsed-csv-files")

for blob in generator:
    print(blob.name)
    csv_file_base = "http://<base_uri>/parsed-csv-files/"
    csvfile = csv_file_base + blob.name
    params = { "csvfile":csvfile }
    mygraph.run(query, parameters=params )

Neo4j debug.log does not seem to be recording any errors.

Sample debug.log:

2019-05-30 05:44:32.022+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16 Number of pages visited: 598507, Number of cleaned crashed pointers: 0, Time spent: 2m 25s 235ms
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job started: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19
2019-05-30 05:44:57.126+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19 Number of pages visited: 96042, Number of cleaned crashed pointers: 0, Time spent: 25s 55ms
2019-05-30 05:44:57.127+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19

EDIT: used simpler query which still gives same issue

Neo4j hangs when it uses full memory allocated to it. **You can increase the max heap memory from `neo4j.conf` and restart Neo4j.** — Rajendra Kadam, May 31 '19 at 10:23
Also create index on `:Paper(paperid)` and `:Keyword(name)` for speeding up queries. . — Rajendra Kadam, May 31 '19 at 10:25
Creating nodes and relationships all in one query is not recommended like you're doing. You can split your query into 2or 3 queries, load nodes and relationships separately — Rajendra Kadam, May 31 '19 at 10:28
Hi Raj, thanks for the reply. We have tried increasing max heap memory and will try again if need be. However, the query DOES finish importing all the records, it's just the response that seems to give an issue. If after the query has finished I manually stop the python script and run again with the next file, neo4j starts running the new query once again. — Andrew Cachia, May 31 '19 at 10:31
Indexes have already been created, sorry for not clarifying. Also, thanks for the pointer, however the above is just a sample query, the same thing happens even with other queries that simply create new records, so long as the file is large. — Andrew Cachia, May 31 '19 at 10:33
Please check this if it help you: https://community.neo4j.com/t/csv-import-hanging-server/4219 — Rajendra Kadam, May 31 '19 at 10:54
Can you try with only one `merge` and no merge or create in your above query? — Rajendra Kadam, May 31 '19 at 11:02
The link you sent unfortunately doesn't help for the following reasons: 1) The issue there is that the DB doesn't enter any records, in our case all records are entered 2) In the link, all subsequent queries cannot ran as they are waiting for the first to finish. In our case, new queries can run but they need to be started manually, and Py2Neo doesn't recognise when the previous one has finished — Andrew Cachia, May 31 '19 at 11:17
@AndrewCachia appreciate you responded, thanks :) We resorted to something similar too. — mns, Oct 29 '20 at 07:29

score 0 · Answer 1 · answered Jun 02 '19 at 00:24

Since the query would take lot of time to complete on DB side, may be py2neo is having issues with waiting.

There should not be any issues with periodic commit.

Have you tried the Python neo4j driver and read csv from python and execute the query that way?

Here's sample code with neo4j driver.

import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(serveruri, auth=(user,pwd))
with driver.session() as session:
    file = config['spins_file']
    row_chunks = pd.read_csv(file, sep=',', error_bad_lines=False,
                       index_col=False,
                       low_memory=False,
                       chunksize=config['chunk_size'])
    for i, rows in enumerate(row_chunks):
        print("Chunk {}".format(i))
        rows_dict = {'rows': rows.fillna(value="").to_dict('records')}
        session.run(statement="""
                    unwind data.rows as row
                    MERGE (:Author { authorid: line[0], name: line[1] } )
                    """,
                    dict=rows_dict)

Neo4j/Py2Neo timeout issue when importing large CSV files

1 Answers1