0

I'm loading relationships into my graph db in Neo4j using the load csv operation. The nodes are already created. I have four different types of relationships to create from four different CSV files (file 1 - 59 relationships, file 2 - 905 relationships, file 3 - 173,000 relationships, file 4 - over 1 million relationships). The cypher queries execute just fine, However file 1 (59 relationships) takes 25 seconds to execute, file 2 took 6.98 minutes and file 3 is still going on since past 2 hours. I'm not sure if these execution times are normal given neo4j's capabilities to handle millions of relationships. A sample cypher query I'm using is given below.

load csv with headers from
"file:/sample.csv"
as rels3
match (a:Index1 {Filename: rels3.Filename})
match (b:Index2 {Field_name: rels3.Field_name})
create (a)-[:relation1 {type: rels3.`relation1`}]->(b)
return a, b

'a' and 'b' are two indices I created for two of the preloaded node categories hoping to speed up lookup operation.

Additional information - Number of nodes (a category) - 1791 Number of nodes (b category) - 3341

Is there a faster way to load this and does load csv operation take so much time? Am i going wrong somewhere?

MohanVS
  • 167
  • 1
  • 1
  • 10

2 Answers2

1

Create an index on Index1.Filename and Index2.Field_name:

CREATE INDEX ON :Index1(Filename);
CREATE INDEX ON :Index2(Field_name);

Verify these indexes are online:

:schema

Verify your query is using the indexes by adding PROFILE to the start of your query and looking at the execution plan to see if the indexes are being used.

More info here

William Lyon
  • 8,371
  • 1
  • 17
  • 22
  • Hello William, I had created the Indexes and they are online. I will now verify whether the indexes are being used or not. – MohanVS Jul 06 '16 at 19:33
  • Yes William, looks like the index was not being used as it was scanning all nodes in the DB. – MohanVS Jul 06 '16 at 20:23
1

What i like to do before running a query is run explain first to see if there are any warnings. I have fixed many a query thanks to the warnings.
(simple pre-append explain to your query)

Also, perhaps you can drop the return statement. After your query finishes you can then run another to just see the nodes.

I create roughly 20M relationships in about 54 mins using a query very similar to yours.

Indices are important because that's how neo finds the nodes.

Albert S
  • 2,552
  • 1
  • 22
  • 28
  • Hello Albert! what kind of hardware do you use to build 20 million relationships in under an hour? I'm trying to run a query with over 100,000 and it hangs after sometime. I'm running Neo on a desktop with 16 gigs of RAM and an i5 quad core processor (3.2 ghz). – MohanVS Jul 25 '16 at 14:22
  • 1
    It is a Intel Xenon E3-1220 4 core @ 3.1Ghz, 32GB of RAM, 1TB SSD. I noticed you have a return statement in your load query, it isn’t really necessary. see if removing it speeds anything up. I am not sure how Neo handles it, but it could keep the nodes it needs to return in memory thereby eating your memory fast - but this is just conjecture. You can always view the data after it is loaded. – Albert S Jul 25 '16 at 18:31
  • I discovered that the queries actually get executed but the browser hangs. When i kill the page and reload it, I can see that all the nodes/relationships are created successfully. And then when i write a query as simple as "match (n) return n limit 100,000" the browser hangs again. The query works for a limit upto 10,000 or so but once it crosses a certain limit the browser hangs. – MohanVS Jul 26 '16 at 19:20
  • I tried removing the return clause, but the console returns an error saying a match must have a return – MohanVS Jul 26 '16 at 19:24
  • I use the neo4j-shell to run LOAD CSV queries in a screen session (i have ran queries that take over 12 hrs to complete). I'm on linux. Can you use the shell? Then you can use the browser to count your loaded nodes/relationships. – Albert S Jul 28 '16 at 20:20