Efficient way to import multiple csv's in neo4j

Question

I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).

I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property). There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.

CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE

I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file. Can someone link me to an efficient and easy way of doing it?

Here is the cypher query:

:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)

I am using Neo4j 4.0.3

Have you indexed the node properties used in the merge? CREATE INDEX [index_name] FOR (n:LabelName) ON (n.propertyName) — David A Stumpf, Jun 04 '20 at 08:43
I have added uniqueness constraint on the node property. According to the documentation, "Adding the unique constraint will implicitly add an index on that property" — kinger, Jun 04 '20 at 11:31
Understood. An index speeds LOAD CSV very significantly. But (I don't know the answer) does a constraint impose an extra burden on LOAD CSV. Preventing a duplicate (constraint) is different than finding whether it exists. It may take an experiment: remove the constraint and add only an index and then compare the results. — David A Stumpf, Jun 04 '20 at 14:58

cybersam · Accepted Answer · 2020-06-03T22:53:00.503

0

Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.

You should consider whether it is safe for you to use CREATE instead of MERGE.

edited Jun 03 '20 at 22:53

answered Jun 03 '20 at 22:43

cybersam

63,203
6
53
76

In my case same nodes can have more than 1 relationship. Since I have uniqueness constraint on the node property, I need to use MERGE instead of CREATE. – kinger Jun 04 '20 at 11:35
Since `p1` and `p2` are already bound to existing nodes, `CREATE (p1)-[...]->(p2)` would never attempt to create new nodes and therefore cannot cause a node constraint violation. – cybersam Jun 04 '20 at 15:19
Thanks. It is much faster now. Also, I removed the toInteger Operation on node property as it was leading to cartesian join (as per the suggestions in https://community.neo4j.com/t/importing-relationships-from-multiple-csv-file/19489/8 ) – kinger Jun 04 '20 at 17:29

score 0 · Answer 2 · answered Jun 05 '20 at 19:21

Is much better if you export the match using the ID of each node and then create the relationship.

POC

CREATE INDEX ON :Person(`pid`);

CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});

and then

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);

For me this is the best way to do this.

Efficient way to import multiple csv's in neo4j

2 Answers2