1

I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).

I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property). There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.

CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE

I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file. Can someone link me to an efficient and easy way of doing it?

Here is the cypher query:

:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)

I am using Neo4j 4.0.3

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
kinger
  • 33
  • 7
  • Have you indexed the node properties used in the merge? CREATE INDEX [index_name] FOR (n:LabelName) ON (n.propertyName) – David A Stumpf Jun 04 '20 at 08:43
  • I have added uniqueness constraint on the node property. According to the documentation, "Adding the unique constraint will implicitly add an index on that property" – kinger Jun 04 '20 at 11:31
  • Understood. An index speeds LOAD CSV very significantly. But (I don't know the answer) does a constraint impose an extra burden on LOAD CSV. Preventing a duplicate (constraint) is different than finding whether it exists. It may take an experiment: remove the constraint and add only an index and then compare the results. – David A Stumpf Jun 04 '20 at 14:58

2 Answers2

0

Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.

You should consider whether it is safe for you to use CREATE instead of MERGE.

cybersam
  • 63,203
  • 6
  • 53
  • 76
  • In my case same nodes can have more than 1 relationship. Since I have uniqueness constraint on the node property, I need to use MERGE instead of CREATE. – kinger Jun 04 '20 at 11:35
  • Since `p1` and `p2` are already bound to existing nodes, `CREATE (p1)-[...]->(p2)` would never attempt to create new nodes and therefore cannot cause a node constraint violation. – cybersam Jun 04 '20 at 15:19
  • Thanks. It is much faster now. Also, I removed the toInteger Operation on node property as it was leading to cartesian join (as per the suggestions in https://community.neo4j.com/t/importing-relationships-from-multiple-csv-file/19489/8 ) – kinger Jun 04 '20 at 17:29
0

Is much better if you export the match using the ID of each node and then create the relationship.

POC

CREATE INDEX ON :Person(`pid`);

CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});

and then

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);

For me this is the best way to do this.

nguaman
  • 925
  • 1
  • 9
  • 23