How to avoid duplicate while creating node using apoc.periodic.iterate for csv load neo4j?

Question

I have to create node from CSV file. As the file is too large I am using apoc.periodic.iterate function to read file. CSV is having duplicate entries, and due to parallel processing duplicate nodes are created. How can I make sure that the duplicate nodes are not created even with parallel processing.

CALL apoc.periodic.iterate('LOAD CSV WITH HEADERS FROM "file:///DATA.csv" AS payload return payload',
'MERGE (l:PERSON {name :payload.name})
 ON CREATE SET 
              l.pid = payload.id,             
              l.createdDate= timestamp(),
              l.lastModifiedDate= timestamp()             
 ON MATCH SET 
              l.lastModifiedDate= timestamp()',
    {batchSize:500, parallel:true, concurrency: 4});

InverseFalcon · Answer 1 · 2021-03-22T12:23:48.303

With parallel execution, you need a unique constraint on :PERSON(name), as unique constraints have schema locks that prevent duplicate creation in this situation.

If names aren't unique, then either disable parallel import or clean your data so there are no duplicates.

You could also lock on a singular node before the MERGE to guarantee mutual exclusion, but that has the same affect as disabling parallel execution, so toggling parallel to false is the better option.

EDIT

So the problem you're running into here is that your MERGE isn't backed by a unique constraint. In order to get this, both the property and the label for the constraint must be present in the pattern, it won't work if you only have them in the ON CREATE or ON MATCH sections.

So if your constraint is on :PERSON(pid), then your pattern should be:

MERGE (l:PERSON {pid:payload.id})

and then you can set the name in the ON CREATE section.

If names are supposed to be unique, then you can get the same effect by creating a unique constraint on :PERSON(name)

While you can use EXPLAIN on the query to get the query plan, since this is a query string, you will need to do some cut and pasting to see the plan for the query in question, as well as make a small addition at the top so the query can compile. You can try this:

EXPLAIN
WITH $payload as payload
MERGE (l:PERSON {name :payload.name})
 ON CREATE SET 
              l.pid = payload.id,             
              l.createdDate= timestamp(),
              l.lastModifiedDate= timestamp()             
 ON MATCH SET 
              l.lastModifiedDate= timestamp()

You want to see a NodeUniqueIndexSeek(Locking) operator in the plan in order for the MERGE to work correctly and use correct locking to prevent duplicates.

If you see NodeIndexSeek then only an index is being used, it won't prevent duplicates as there's nothing to lock on to guarantee mutual exclusion. You need that unique constraint.

And if NodeByLabelScan is there instead, that's even worse, as you don't have a unique constraint or an index backing your MERGE. If execution time was a problem for you earlier, it's probably because it was doing a label scan, which won't perform well at all for loading.

I have a constraint on pid, due to which if the duplicate entry is there in CSV and the script tries to create a node, it fails and the complete batch is skipped. — Piyush Srivastava, Mar 22 '21 at 11:13
if we toggle parallel to false then the execution time for complete csv will increase. CSV is having approx 100K records. — Piyush Srivastava, Mar 22 '21 at 11:16
Then you should MERGE on the pid, not the name (so swap them so the pid is in the pattern, and the name is set on creation), that should get you the behavior you want. Also processing 100k records should only take a matter of seconds at most, just need to make sure you have an index (or unique constraint) to back that MERGE. — InverseFalcon, Mar 22 '21 at 11:31
Name is also unique for all the node, the issue is not with the constraint. Scenerio is like if there is a duplicate entry in CSV and node does not exits in DB. Now both duplicate entries are picked by a different batch, so even after using merge both the batch will try to create the node and conflict will raise. — Piyush Srivastava, Mar 22 '21 at 11:58
A MERGE backed by a unique constraint will work fine. If it's not backed by a unique constraint (and your constraint on :PERSON(pid) won't help if `pid` isn't in the pattern) then you would get the behavior you're currently facing. You can use EXPLAIN on a modified version of the updating query to see its plan (prefix it with `WITH $payload as payload` to get it to compile), you want to see `NodeUniqueIndexSeek(Locking)` in the query plan. — InverseFalcon, Mar 22 '21 at 12:10
I added on with more context about how to get the query plan and what to look for. — InverseFalcon, Mar 22 '21 at 12:24

How to avoid duplicate while creating node using apoc.periodic.iterate for csv load neo4j?

1 Answers1