create unique and merge slow

Question

I'm playing with a neo4j database consisting of ~ 275,000 English words linked to the letters they contain. Am running Neo4j 2.0.1 Community Edition on Windows.

Am trying to use the following Cypher to insert new word nodes into the graph, update properties on those nodes, and then create new relationships to existing (letter) nodes if the word node is newly added:

BEGIN
MATCH (A:Letter {token:"A"}),
(B:Letter {token:"B"}),
(C:Letter {token:"C"}),
(D:Letter {token:"D"}),
(E:Letter {token:"E"}),
(F:Letter {token:"F"}),
(G:Letter {token:"G"}),
(H:Letter {token:"H"}),
(I:Letter {token:"I"}),
(J:Letter {token:"J"}),
(K:Letter {token:"K"}),
(L:Letter {token:"L"}),
(M:Letter {token:"M"}),
(N:Letter {token:"N"}),
(O:Letter {token:"O"}),
(P:Letter {token:"P"}),
(Q:Letter {token:"Q"}),
(R:Letter {token:"R"}),
(S:Letter {token:"S"}),
(T:Letter {token:"T"}),
(U:Letter {token:"U"}),
(V:Letter {token:"V"}),
(W:Letter {token:"W"}),
(X:Letter {token:"X"}),
(Y:Letter {token:"Y"}),
(Z:Letter {token:"Z"})
// Create Words and link to proper letters
MERGE (w1:Word {string:"WHOSE", length:5})
ON MATCH SET w1.s_enable1=TRUE
ON CREATE SET w1.s_enable1=TRUE
// create the letter->word relationships if necessary
CREATE UNIQUE (w1) <-[:IN_WORD {position:1}]- (W)
CREATE UNIQUE (w1) <-[:IN_WORD {position:2}]- (H)
CREATE UNIQUE (w1) <-[:IN_WORD {position:3}]- (O)
CREATE UNIQUE (w1) <-[:IN_WORD {position:4}]- (S)
CREATE UNIQUE (w1) <-[:IN_WORD {position:5}]- (E)
MERGE (w2:Word {string:"WHOSESOEVER", length:11})
ON MATCH SET w2.s_enable1=TRUE
ON CREATE SET w2.s_enable1=TRUE
CREATE UNIQUE (w2) <-[:IN_WORD {position:1}]- (W)
CREATE UNIQUE (w2) <-[:IN_WORD {position:2}]- (H)
CREATE UNIQUE (w2) <-[:IN_WORD {position:3}]- (O)
CREATE UNIQUE (w2) <-[:IN_WORD {position:4}]- (S)
CREATE UNIQUE (w2) <-[:IN_WORD {position:5}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:6}]- (S)
CREATE UNIQUE (w2) <-[:IN_WORD {position:7}]- (O)
CREATE UNIQUE (w2) <-[:IN_WORD {position:8}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:9}]- (V)
CREATE UNIQUE (w2) <-[:IN_WORD {position:10}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:11}]- (R)
... N-2 more of these ...;
COMMIT
... M-1 more transactions ...

I'm using the neo4j-shell to execute Cypher command files like this one to add new words. Most of the words being MERGED already exist in the graph. Only a small fraction are new.

This code generally works except: (a) It runs very slowly (e.g., 50 secs/50 word transactions when N = 50), and (b) When new relationships need to be created (using CREATE UNIQUE), transactions slow to many minutes and occasionally fail with the error "GC overhead limit exceeded".

I also tried this using MERGEs in place of the CREATE UNIQUEs. That generally worked similarly (very slow) and eventually failed with a Java Heap memory error after a number of transactions were run. (Seemed like some kind of memory leak.)

Any insights on what I'm doing wrong and/or better ways to accomplish this task would be greatly appreciated.

More Info

This graph is mainly to provide a hands on prototype to help understand Neoj4 features and functions in a domain of interest: language stucture, word statistics, queries useful for word games (crossword puzzles, scrabble, words with friends, hangman, ...).

All properties have been indexed (in neo4j.properties file and CREATE INDEX ON commands).

s_enable1 denotes the source of the word list being added. In this case, the "enable1" dictionary (173,122 words). The initial graph was created using the "sowpods" dictionary (267,751 words). The s_ prefix stands for "source." Every time a new dictionary is added to the graph, a new property will be created to indicate which words (existing and new) are associated with each list. (For example, the word AA appears in both the sowpods and enable1 dictionaries, thus the AA word node will have both an s_sowpods and s_enable1 property set to TRUE.)

MERGE or CREATE UNIQUE seem well suited to continually update the graph as new dictionaries are added.

The sowpods build created about 2.5 million (letter)-[:IN_WORD]->(word) relationships. The enable1 merge might create another 500 K or so. (Many enable1 words are quite long, e.g., 16 - 21 letters.)

OS is Windows 7. Running Java 7.51 x64. (Was originally running x32 which was 2x slower.) java -XshowSettings shows 885.5 M max heap. Database settings are mostly default I believe. (Which settings are particularly salient?)

What do you actually want to achieve? Did you create constraints/indexes for your match/merge operations? What does s_enable1 mean? And why do you set it on match too? How many relationships do you have in your database? What are your database settings and what OS are you running on? Please update your post with all this information. — Michael Hunger, Apr 08 '14 at 00:29
If you run this in the neo4j-shell without any parameters it has to parse each of those big statements too. Poor Cypher parser, try to create smaller statement which only match the letters they need. Also all your letters will be supernodes in the graph so you might want to consider to look at 2.1. And you don't need create unique if you put your create relationshipstatement into the on create clause. — Michael Hunger, Apr 08 '14 at 00:32
Thanks for the inputs! Not sure how to parameterize my Cypher scripts using the neo shell. (Will need to read up on that.) I wondered if the SET ON CREATE construct would allow creating the new/missing links conditionally as new word nodes are created. Appears that is only applicable to property creation. Still not sure how to restructure things for accomplishing the kind of MERGE update I need. More hints welcome! — jeffj4a, Apr 08 '14 at 01:41
I would also change the `s_enable1=true` to source="enable1" and `source="sowpods"` should be easier to handle. — Michael Hunger, Apr 08 '14 at 12:30
On windows increase the heap size (e.g. to 3 or 4G) as the Neo4j memory mapping happens inside of the heap. What are the target counts for nodes and rels? — Michael Hunger, Apr 08 '14 at 12:31
This sounds like a cool project, would you mind sharing your database when you're done? Or the original sources and an import-script? — Michael Hunger, Apr 08 '14 at 12:32
@MichaelHunger Thanks again. Haven't tried to work with your answer yet but have a couple of quick follow-ups re: your comments. (1) When you say increase heap size, do you mean via the -Xmx setting in neo4j-community.vmoptions? (Not windows heap setting via regedit I presume.) I tried upping it to -Xmx2048m and still get GC error. I only have 4 GB RAM on target machine. 3-4GB heap seems too high, (2) I need multiple s_ properties since each Word can come from multiple sources. (Didn't want to use a single source bit field for this.), (3) Probably can share my DB when done. Where? — jeffj4a, Apr 09 '14 at 20:24
Don't know about windows, usually heap is set in neo4j-wrapper.conf, 2G should kind of work. Ok, I understand your multiple source properties. You can share your db zipped e.g. on dropbox or google drive. — Michael Hunger, Apr 10 '14 at 00:54

score 4 · Accepted Answer · answered Apr 08 '14 at 12:17

You don't have to parametrize the first part, but you need an index/constraint for that:

create constraint on (l:Letter) assert l.token is unique;
create constraint on (w:Word) assert l.string is unique;

To parametrize on the shell, you can do:

export word=WHOSE

MATCH (w:Word {string:{word}}) RETURN w;

Unfortunately Neo4j's split operation does not work yet on empty split strings.

otherwise something like that would have been possible: WITH split({word},"") as letters

MERGE (w:Word {string:{word}, length:length({word})})
   ON CREATE SET w.s_enable1=TRUE
FOREACH (i in range(0,length({word})-1) | 
  MERGE (l:Letter {token:substring({word},i,1)})
  MERGE (l)-[:IN_WORD {position:i}]->(w)
)

Concrete example w/o parameter:

MERGE (w:Word {string:"STACKOVERFLOW", length:length("STACKOVERFLOW")})
   ON CREATE SET w.s_enable1=TRUE
FOREACH (i in range(0,length("STACKOVERFLOW")-1) | 
  MERGE (l:Letter {token:substring("STACKOVERFLOW",i,1)})
  MERGE (l)-[:IN_WORD {position:i}]->(w)
)

You can try it here: http://console.neo4j.org

This is cool. Very elegant and speedy! However :-( ... doesn't solve my problem of **adding new words to an already-existing WordGraph**. I need the `ON MATCH SET w.s_enable1=TRUE` to set the s_enable1 property on preexisting words **and make no other additions**. It appears the final `MERGE (l)-[:IN_WORD {position:i}]->(w)` statement causes redundant relationships to be added if the word already existed (and has other properties such as s_sowpods=TRUE). Any ideas how to prevent such redundant relationships from being formed? — jeffj4a, Apr 10 '14 at 18:41
False alarm. The :IN_WORD(position) property was indexed 1 - N in my original database. Your code indexed from 0 - N-1. All fixed by changing last command to `MERGE (l)-[:IN_WORD {position:i+1}]->(w)`. Thanks! — jeffj4a, Apr 10 '14 at 19:49

score 2 · Answer 2 · edited May 23 '17 at 10:30

Here are my experiences and tips:

Use BatchInserter whenever possible. It requires the database to be offline and you structure code in restrictive way, but if you can do that you'll be rewarded.
Put your database on RAM disk (tmpfs). This has about ~7.5x speed up on my system, from ~200 CREATE/s on HDD to ~1500 CREATE/s. For SSD speed up may be a bit less, but from my experience the HDD→SSD improvement is sadly not so significant for Neo4j. I'm in a tight spot because my DB is currently about 4 GiB and I only have 8 GB RAM, but even with potential virtual memory swapping tmpfs performance is much better than HDD+Linux buffers/cache.
Limit Cypher query size, my sweet spot is 10-50 MERGE per Cypher query. Beyond 50 MERGEs, it gets slower. Around 500-1000 MERGEs in a Cypher query, Neo4j throws StackOverflowError.
Use up to number_of_CPUs - 2 threads to run Neo4j transactions, leave 1 thread for main thread and another for Neo4j's own transaction writes. This results in ~95% CPU utilization across all 8 logical cores on my i7-3770K.
Clear indexes before running, and create them after all operations have finished. When you need indexes, create your own in-memory (i.e. ConcurrentHashMap but ImmutableMap is better) but make sure you size your heap accordingly. Even if your heap goes into virtual memory, it's still faster than Neo4j's transactional flushes.
If possible, create all indexed nodes beforehand. This means you don't have to care about uniqueness when you create relationships, and you can use MATCH instead of MERGE. For in-memory index, this means even if can't use ImmutableMap, you can use HashMap instead of ConcurrentHashMap even with multithreading.

create unique and merge slow

2 Answers2