Hello: I am posting this rather extensive answer as I recently encountered surprising difficulty dealing with NULL (missing) values present in my CSV files, when attempting to load those data into Neo4j (neo4j 3.3.4).
I present three solutions.
I am using the Cycli (cycli 0.7.6) CLI, installed via pip in a Python 3.5 venv on an Arch Linux x86_64 system.
My CSV file (glycolysis_metabolites.csv) is:
name,abbreviation,kegg_entry
α-D-glucose,GLC,C00267
glucose 6-phosphate,G6P,C00668
fructose 6-phosphate,F6P,C05345
"fructose 1,6-bisphosphate",FBP,C05378
dihydroxyacetone phosphate,DHAP,C00111
D-glyceraldehyde 3-phosphate,,C00118
"1,3-bisphosphoglycerate","1,3-BPG",C00236
3-phosphoglycerate,3PG,C00197
2-phosphoglycerate,2PG,C00631
phosphoenolpyruvate,PEP,C00074
pyruvate,,C00022
Those data, copied from a PostgreSQL table via the psql /COPY ... command, have a "UNIQUE NOT NULL" constraint on the "name" field.
After surveying Google etc. I conducted three Experiments, below. Experiments 2 and 3 are basically the same.
I believe that the approach shown in Experiment 2 is the best solution, as the COALESCE statements are included within the MERGE statement.
My reason for this conclusion is that Experiment 2 uses "local" variables, rather than returning "global" variables (Experiment 3), thus minimizing unintended consequences on reused variable names.
I load my Cypher script as follows:
cat glycolysis_script.cypher | cypher-shell -u victoria -p <your_password>
** EXPERIMENT 1**
Reference: http://markhneedham.com/blog/2014/08/22/neo4j-load-csv-handling-empty-columns/
This solution (Mark Needham's) is pretty clever: it creates nodes containing all non-NULL properties, e.g.
<id>: 0 abbreviation: GLC kegg_entry: C00267 name: α-D-glucose
<id>: 10 kegg_entry: C00022 name: pyruvate
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
MERGE (a:GlycolysisMetabolites {name: row.name})
FOREACH(ignoreMe IN CASE WHEN row.abbreviation <> "" THEN [1] ELSE [] END | SET a.abbreviation = row.abbreviation)
FOREACH(ignoreMe IN CASE WHEN row.kegg_entry <> "" THEN [1] ELSE [] END | SET a.kegg_entry = row.kegg_entry)
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"
However, you cannot set your own MERGE specifications, on properties that include NULL values (here: "abbreviation") -- the reason being that you cannot MERGE on NULL property values.
Works:
MERGE (a:GlycolysisMetabolites {name: row.name})
Fails ("Cannot merge node using null property value for abbreviation"):
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation, kegg_entry:row.kegg_entry})
EXPERIMENT 2
Reference: Neo4j use MERGE with null values
Here, I set an empty string ('') as a replacement for the NULL values present in the CSV file; you can use whatever you want; e.g.: 'Undefined', 'null', ...
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
// MERGE (a:GlycolysisMetabolites {name: row.name})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:COALESCE(row.abbreviation, ''), kegg_entry:COALESCE(row.kegg_entry, '')})
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"
EXPERIMENT 3
References:
Neo4j use MERGE with null values
https://github.com/neo4j/neo4j/issues/2521
This also works, but as the COALESCE statements are outside the MERGE statement, I worry that the data returned by the RETURN statement could cause problems, if those variable names are reused elsewhere. As a workaround,
I added a prefix (a_) as a quasi-UID, but I think the solution in Experiment 2, above, is the better approach.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
WITH
COALESCE(CASE row.name WHEN '' THEN null ELSE row.name END, '') AS a_name,
COALESCE(CASE row.abbreviation WHEN '' THEN null ELSE row.abbreviation END, '') AS a_abbreviation,
COALESCE(CASE row.kegg_entry WHEN '' THEN null ELSE row.kegg_entry END, '') AS a_kegg_entry
MERGE (a:GlycolysisMetabolites {name:a_name, abbreviation:a_abbreviation, kegg_entry:a_kegg_entry})
// Note: RETURN can only be used at the end of the query
RETURN a_name, a_abbreviation, a_kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a_name, a_abbreviation, a_kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"
Additional StackOverflow discussion on this topic/issue:
https://stackoverflow.com/search?tab=votes&q=Neo4j%20use%20MERGE%20with%20null%20value
Addendum
Reference (e.g.): Neo4j CSV file load with empty cells
This "works", but SKIPS the creation of a node if any of the fields contains NULL values:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
FOREACH (
x IN CASE WHEN row.abbreviation IS NULL OR row.kegg_entry IS NULL THEN [] ELSE [1] END |
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation: row.abbreviation, kegg_entry: row.kegg_entry})
)
RETURN row.name, row.abbreviation, row.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <password>
row.name, row.abbreviation, row.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"
Note that in the Neo4j Browser, only 9 (not 11) nodes are created: nodes for "D-glyceraldehyde 3-phosphate" and "pyruvate" are not created.