1

I am trying to load a CSV file into my Neo4j database hosted on GrapheneDB. It worked fine on the first file with 5000 rows. It took about 16 seconds to complete this file.

I am now importing the second file with the same schema and same amount of rows. The data is just different. The Cypher query has been running over 30 minutes and its still not complete. I am not sure what its doing and why its so slow. Here is my cypher:

USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'http://example.com/some.csv' AS line
Match (c:Customer {customerID: line.CustomerID}) 
MERGE (c)<-[r:DEPT_OF]-(dept:Dept { name: line.Category})
ON CREATE 
SET dept.name = line.Category, dept.deptID=line.DeptID, dept.createdDTS=1453478149463
MERGE (dept)<-[r1:IN_DEPT]-(pt:ProductType {name: dept.name}) 
ON CREATE 
SET pt.name = dept.name, pt.packQty = line.PackQty, pt.createdDTS = 1453478149463, 
pt.productTypeID = line.ProductTypeID
MERGE (pt)<-[r2:OF_TYPE]-(st:Style {name: line.Style})
ON CREATE 
SET st.name = line.Style, st.styleID = line.StyleID, st.styleNum = line.StyleNo, st.price = line.Price
MERGE (st)<-[r3:OF_STYLE]-(p:Product {productNum: line.UPC})
ON CREATE 
SET p.floorMin = line.MinFloor, p.floorMax = line.FloorMax, p.color = line.Color, p.createdDTS = 1453478149463,
p.size = line.Size, p.productID = line.ProductID;

For rows from my csv:

UPC,Category,Style,StyleNo,Color,Size,MinFloor,MaxFloor,Price,ProductID,CustomerID,ProductType,PackQty,DeptID,StyleID,ProductTypeID,ProductID
33383605005,FRESH VEGETABLES,GREEN ONIONS 24/10 OZ,NA,NA,NA,0,0,1.79,,5f795a69-47cb-49c8-a334-0cf5d67be423,FRESH VEGETABLES,1,538a02c6-b6b7-4d0d-8dca-5ff3a513d59e,3e08dabb-415a-4826-86e8-44efb9813892,cc6a0f3c-1c05-44a0-b603-37cbbb60954e,3324b2b1-954a-4547-a82d-553be66d7b54
52867010005,FRESH VEGETABLES,GREEN ONIONS 24/10 OZ,NA,NA,NA,0,0,1.79,,5f795a69-47cb-49c8-a334-0cf5d67be423,FRESH VEGETABLES,1,edfa998f-3749-4d1f-bd96-3f4a5db0de67,fb11a8e5-de49-44da-924a-9ebc5f7f01d2,d47fd5d8-dbf0-4110-b701-543e6ed0ae40,28a9d206-96c6-4fe4-b528-84e446ba3c16

Update 1:

I have added the following indexes based off the response from Nicole.

  1. CREATE INDEX ON :Customer(customerID)
  2. CREATE INDEX ON :Dept(name)
  3. CREATE INDEX ON :ProductType(name)
  4. CREATE INDEX ON :Style(name)
  5. CREATE INDEX ON :Product(productNum)

This helped a lot, but still takes about 20 seconds for 5K rows. Is that normal?

Any help is appreciated.

Update 2:

Based off the response from @michael, I researched a little further and found the following article very useful:

http://graphaware.com/neo4j/2014/07/31/cypher-merge-explained.html

Update 3:

I have updated my cypher to the following to avoid duplicates. I hope this looks right?

USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM '...' AS line
Match (c:Customer {customerID: line.CustomerID}) 
MERGE (c)<-[r:DEPT_OF]-(dept:Dept { name: line.Category })
ON CREATE 
SET dept.name = dept.name, dept.deptID=line.categoryID, dept.createdDTS=1453742532269, dept.modifiedDTS = 1453742532269
MERGE (c)<-[r22:DEPT_OF]-(dept)
MERGE (dept)<-[r1:IN_DEPT]-(pt:ProductType {name: dept.name}) 
ON CREATE 
SET pt.name = dept.name, pt.packQty = line.PackQty, pt.createdDTS = 1453742532269, pt.productTypeID = line.ProductTypeID, pt.modifiedDTS = 1453742532269
MERGE (c)<-[r2:DEPT_OF]-(dept)
MERGE (dept)<-[r3:IN_DEPT]-(pt) 
MERGE (pt)<-[r4:OF_TYPE]-(st:Style {name: line.Style})
ON CREATE 
SET st.name = line.Style, st.styleID = line.StyleID, st.styleNum = line.StyleNo, st.price = line.Price, st.modifiedDTS = 1453742532269, st.createdDTS = 1453742532269
MERGE (c)<-[r5:DEPT_OF]-(dept)
MERGE (dept)<-[r6:IN_DEPT]-(pt) 
MERGE (pt)<-[r7:OF_TYPE]-(st)
MERGE (st)<-[r8:OF_STYLE]-(p:Product {productNum: line.UPC})
ON CREATE 
SET p.floorMin = line.MinFloor, p.floorMax = line.FloorMax, p.color = line.Color, p.createdDTS = 1453742532269,p.modifiedDTS = 1453742532269, p.size = line.Size, p.productID = line.ProductID;
Sonu Kapoor
  • 1,567
  • 3
  • 16
  • 34
  • Do you have indexes on anything? – Nicole White Jan 22 '16 at 18:14
  • No. How can I add the index? – Sonu Kapoor Jan 22 '16 at 18:15
  • I am not sure on which nodes I should add the indexes. Can you suggest some based of the cypher? – Sonu Kapoor Jan 22 '16 at 18:21
  • Hello @skone, I'm the CEO of GrapheneDB. Besides the obvious (indexes, # of runs per commit), the performance will depend on the GrapheneDB plan you are using. Which plan are you currently using? BTW, feel free to contact our support team if you experience any issues. – albertoperdomo Jan 22 '16 at 18:53
  • Why do you have Depts with different IDs but the same name? Do you want those to be the same node? – Nicole White Jan 22 '16 at 18:56
  • @NicoleWhite The script should create the first dept with the UUID on the same row. If the same dept is found again by using the match, then the on create will not called and therefore the UUID will actually not be used. I was to lazy to create a VLOOKUP. Hope that makes sense. – Sonu Kapoor Jan 22 '16 at 21:11
  • @albertoperdomo ticket opened. Can you have somebody review this asap please? – Sonu Kapoor Jan 22 '16 at 21:13
  • which version of Neo4j do you use? – Michael Hunger Jan 23 '16 at 00:08
  • you understand that a merge operation like this will create the ProductType node again as soon as it doesn't find the full pattern? `MERGE (dept)<-[r1:IN_DEPT]-(pt:ProductType {name: dept.name})` – Michael Hunger Jan 23 '16 at 00:10
  • Usually you'd do: `MERGE (pt:ProductType {name: dept.name}) MERGE (dept)<-[r1:IN_DEPT]-(pt)` – Michael Hunger Jan 23 '16 at 00:10
  • Also I would not use indexes but unique constraints: `CREATE CONSTRAINT ON (c:Customer) ASSERT c.customerID IS UNIQUE` – Michael Hunger Jan 23 '16 at 00:11
  • @MichaelHunger - I just notice the duplicates, thats a boomer. I did not know that. I come from the SQL world, so I was trying to do an If Not Exists. I guess that works differently in node when using a merge. Is the rest of my cypher correct? – Sonu Kapoor Jan 25 '16 at 15:32
  • @MichaelHunger, do you suggest I create constraints for all nodes? The other constraints are customer specific - for example the same department name can exist for different customers. – Sonu Kapoor Jan 26 '16 at 15:36
  • Possible duplicate of [Neo4j CSV import being too slow](https://stackoverflow.com/questions/41792274/neo4j-csv-import-being-too-slow) – c z Nov 08 '17 at 16:02
  • You could also make it local so you can use other methods. Then upload it to graphene db? – Mvde Jan 25 '16 at 11:07
  • If you are the first time import data (the amount will be very large), use create instead of merge. the speed will dramatically increase. – Henry Sou Feb 22 '19 at 03:58

0 Answers0