I'm playing with a neo4j database consisting of ~ 275,000 English words linked to the letters they contain. Am running Neo4j 2.0.1 Community Edition on Windows.
Am trying to use the following Cypher to insert new word nodes into the graph, update properties on those nodes, and then create new relationships to existing (letter) nodes if the word node is newly added:
BEGIN
MATCH (A:Letter {token:"A"}),
(B:Letter {token:"B"}),
(C:Letter {token:"C"}),
(D:Letter {token:"D"}),
(E:Letter {token:"E"}),
(F:Letter {token:"F"}),
(G:Letter {token:"G"}),
(H:Letter {token:"H"}),
(I:Letter {token:"I"}),
(J:Letter {token:"J"}),
(K:Letter {token:"K"}),
(L:Letter {token:"L"}),
(M:Letter {token:"M"}),
(N:Letter {token:"N"}),
(O:Letter {token:"O"}),
(P:Letter {token:"P"}),
(Q:Letter {token:"Q"}),
(R:Letter {token:"R"}),
(S:Letter {token:"S"}),
(T:Letter {token:"T"}),
(U:Letter {token:"U"}),
(V:Letter {token:"V"}),
(W:Letter {token:"W"}),
(X:Letter {token:"X"}),
(Y:Letter {token:"Y"}),
(Z:Letter {token:"Z"})
// Create Words and link to proper letters
MERGE (w1:Word {string:"WHOSE", length:5})
ON MATCH SET w1.s_enable1=TRUE
ON CREATE SET w1.s_enable1=TRUE
// create the letter->word relationships if necessary
CREATE UNIQUE (w1) <-[:IN_WORD {position:1}]- (W)
CREATE UNIQUE (w1) <-[:IN_WORD {position:2}]- (H)
CREATE UNIQUE (w1) <-[:IN_WORD {position:3}]- (O)
CREATE UNIQUE (w1) <-[:IN_WORD {position:4}]- (S)
CREATE UNIQUE (w1) <-[:IN_WORD {position:5}]- (E)
MERGE (w2:Word {string:"WHOSESOEVER", length:11})
ON MATCH SET w2.s_enable1=TRUE
ON CREATE SET w2.s_enable1=TRUE
CREATE UNIQUE (w2) <-[:IN_WORD {position:1}]- (W)
CREATE UNIQUE (w2) <-[:IN_WORD {position:2}]- (H)
CREATE UNIQUE (w2) <-[:IN_WORD {position:3}]- (O)
CREATE UNIQUE (w2) <-[:IN_WORD {position:4}]- (S)
CREATE UNIQUE (w2) <-[:IN_WORD {position:5}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:6}]- (S)
CREATE UNIQUE (w2) <-[:IN_WORD {position:7}]- (O)
CREATE UNIQUE (w2) <-[:IN_WORD {position:8}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:9}]- (V)
CREATE UNIQUE (w2) <-[:IN_WORD {position:10}]- (E)
CREATE UNIQUE (w2) <-[:IN_WORD {position:11}]- (R)
... N-2 more of these ...;
COMMIT
... M-1 more transactions ...
I'm using the neo4j-shell to execute Cypher command files like this one to add new words. Most of the words being MERGED already exist in the graph. Only a small fraction are new.
This code generally works except: (a) It runs very slowly (e.g., 50 secs/50 word transactions when N = 50), and (b) When new relationships need to be created (using CREATE UNIQUE), transactions slow to many minutes and occasionally fail with the error "GC overhead limit exceeded".
I also tried this using MERGEs in place of the CREATE UNIQUEs. That generally worked similarly (very slow) and eventually failed with a Java Heap memory error after a number of transactions were run. (Seemed like some kind of memory leak.)
Any insights on what I'm doing wrong and/or better ways to accomplish this task would be greatly appreciated.
More Info
This graph is mainly to provide a hands on prototype to help understand Neoj4 features and functions in a domain of interest: language stucture, word statistics, queries useful for word games (crossword puzzles, scrabble, words with friends, hangman, ...).
All properties have been indexed (in neo4j.properties file and CREATE INDEX ON commands).
s_enable1 denotes the source of the word list being added. In this case, the "enable1" dictionary (173,122 words). The initial graph was created using the "sowpods" dictionary (267,751 words). The s_ prefix stands for "source." Every time a new dictionary is added to the graph, a new property will be created to indicate which words (existing and new) are associated with each list. (For example, the word AA appears in both the sowpods and enable1 dictionaries, thus the AA word node will have both an s_sowpods and s_enable1 property set to TRUE.)
MERGE or CREATE UNIQUE seem well suited to continually update the graph as new dictionaries are added.
The sowpods build created about 2.5 million (letter)-[:IN_WORD]->(word) relationships. The enable1 merge might create another 500 K or so. (Many enable1 words are quite long, e.g., 16 - 21 letters.)
OS is Windows 7. Running Java 7.51 x64. (Was originally running x32 which was 2x slower.) java -XshowSettings shows 885.5 M max heap. Database settings are mostly default I believe. (Which settings are particularly salient?)