1

I wrote a quick ruby routine to load some very large csv data. I got frustrated with various out of memory issues trying to use load_csv so reverted to ruby. I'm relatively new to neo4j so trying Neography to just call a cypher query I create as a string.

The cypher code is using merge to add a relationship between 2 existing nodes:

cmdstr=match (a:Provider {npi: xxx}),(b:Provider {npi:yyy}) merge (a)-[:REFERS_TO {qty: 1}]->(b);

@neo.execute_query(cmdstr)

I'm just looping through the rows in a file running these. It fails after about 30000 rows with socket error "cannot assign requested address". I believe GC is somehow causing issues. However the logs don't tell me anything. I've tried tuning GC differently, and trying different amounts of heap. Fails in the same place everytime. Any help appreciated.

[edit] More info - Running netstat --inet shows thousands of connections to localhost:7474. Does execute_query not reuse connections by design or is this an issue?

I've now tried parameters and the behavior is the same. How would you code this kind of query using batches and make sure you use the index on npi?

ftroop
  • 21
  • 3

2 Answers2

1

I was finally able to get this to work by changing the MERGE to a CREATE (deleting all relationships first). Still took a long time but it stayed linear relative to the number of relationships.

I also changed garbage collection from Concurrent/Sweep to parallelGC. The concurrent sweep would just fail and revert to a full GC anyway.

#wrapper.java.additional=-XX:+UseConcMarkSweepGC wrapper.java.additional=-XX:+UseParallelGC wrapper.java.additonal=-XX:+UseNUMA wrapper.java.additional=-XX:+CMSClassUnloadingEnabled wrapper.java.additional=-Xmn630m

ftroop
  • 21
  • 3
0

With Neo4j 2.1.3 the LOAD CSV issue is resolved:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://npi_data.csv" as line
MATCH (a:Provider {npi: line.xxx})
MATCH (b:Provider {npi: line.yyy}) 
MERGE (a)-[:REFERS_TO {qty: line.qty}]->(b);

In your ruby code you should use Cypher parameters and probably the transactional API. Do you limit the concurrency of your requests somehow (e.g. single client)?

Also make sure to have an index or constraint created for your providers:

 create index on :Provider(npi);

or

 create constraint on (p:Provider) assert p.npi is unique;
Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • Thanks for the answer. I am using 2.1.3. I'm trying to load about 154M relationships with about 4 properties each and I have not found a strategy that seems to work well. Trying parameters in neography throws an error saying you can't use parameter msps with MERGE. – ftroop Aug 18 '14 at 18:36
  • Also, the code above is what I was doing. It bogs down after about 5 minutes with huge GC times. I've burned many hours playing with different settings for GC and memory. It seems it will just take days to get this data loaded. – ftroop Aug 18 '14 at 18:41
  • Also, running neography even with parameters gets socket errors after about 25k rows. /var/lib/gems/1.9.1/gems/excon-0.39.4/lib/excon/socket.rb:187:in `connect_nonblock': Cannot assign requested address - connect(2) (Errno::EADDRNOTAVAIL) (Excon::Errors::SocketError) – ftroop Aug 18 '14 at 19:06