How to create billions of nodes in Neo4j?

Question

I want to test the Neo4j performance with large number of nodes. I am thinking of creating billions of nodes and then want to see how much time it takes to fetch a node meeting some criteria. Like 1 billion nodes labeled Person having SSN property

match (p:Person) where p.SSN=4255556656425 return p;

But how can I create 1 billion nodes, is there a way to generate 1 billion nodes?

Michael Hunger · Answer 1 · 2015-01-31T10:46:13.370

What you would be measuring then is the performance of the lucene index. So not a graph-database operation.

There are a number of options:

neo4j-import

Neo4j 2.2.0-M03 comes with neo4j-import, a tool that can quickly and scalable import a 1 billion node csv into Neo4j.

parallel-batch-importer API

this is very new in Neo4j 2.2

I created a node-only Graph with 1.000.000.000 nodes in 5mins 13s (53G db) with the new ParallelBatchImporter. Which makes it about 3.2M nodes/second.

Code is here: https://gist.github.com/jexp/0ff850ab2ce41c9ca5e6

batch-inserter

You could use the Neo4j Batch-Inserter-API to create that data without creating the CSV first.

see this example here which you would have to adopt to not read CSV but generate the data directly from a for loop: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/

Cypher

If you want to use Cypher I'd recommend to run something like this in the JAVA_OPTS="-Xmx4G -Xms4G" bin/neo4j-shell -path billion.db:

Here is the code and timings for 10M and 100M I took on my macbook:

create a csv file with 1M lines

ruby -e 'File.open("million.csv","w") 
   { |f| (1..1000000).each{|i| f.write(i.to_s + "\n") }  }'

Experiment running on a MacBook Pro Cypher execution is single threaded estimated size (15+42) bytes * node count

// on my laptop
// 10M nodes, 1 property, 1 label each in 98228 ms (98s) taking 580 MB on disk

using periodic commit 10000
load csv from "file:million.csv" as row
//with row limit 5000
foreach (x in range(0,9) | create (:Person {id:toInt(row[0])*10+x}));

// on my laptop
// 100M nodes, 1 property, 1 label each in 1684411 ms (28 mins) taking 6 GB on disk

using periodic commit 1000
load csv from "file:million.csv" as row
foreach (x in range(0,99) | create (:Person {id:toInt(row[0])*100+x}));

// on my linux server
// 1B nodes, 1 property, 1 label each in 10588883 ms (176 min) taking 63 GB on disk

using periodic commit 1000
load csv from "file:million.csv" as row
foreach (x in range(0,999) | create (:Person {id:toInt(row[0])*100+x}));

creating indexes

create index on :Person(id);
schema await

// took about 40 mins and increased the database size to 85 GB

then I can run

match (:Person {id:8005300}) return count(*);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row
2 ms

The CSV import can become very complicated when you have many relationships between your nodes. In my experience, the Batch Inserter is the worst of the bunch, because it doesn't support threading (which defeats the purpose in my opinion). Also the performance gets much worse when each node is ~1kb of content instead of just a single number. I've never used parallel-batch-importer but looks promising. — James Watkins, Oct 25 '15 at 18:28
Yes the parallel batch inserter uses all available cpu's and disk-io. — Michael Hunger, Nov 14 '15 at 01:52

score 6 · Answer 2 · edited Jul 24 '19 at 04:44

The other simple answer is a good one. If you want something a bit more involved, Michael Hunger posted a good blog entry on this. He recommends something which is basically very similar, but you can loop with some sample data as well, and use random numbers to establish linkages.

Here's how he created 100,000 users and products and linked them, customize as you see fit:

WITH ["Andres","Wes","Rik","Mark","Peter","Kenny","Michael","Stefan","Max","Chris"] AS names
FOREACH (r IN range(0,100000) | CREATE (:User {id:r, name:names[r % size(names)]+" "+r}));

with ["Mac","iPhone","Das Keyboard","Kymera Wand","HyperJuice Battery",
"Peachy Printer","HexaAirBot",
"AR-Drone","Sonic Screwdriver",
"Zentable","PowerUp"] as names
    foreach (r in range(0,50) | create (:Product {id:r, name:names[r % size(names)]+" "+r}));

Let's not forget sweet random linkage:

match (u:User),(p:Product)
where rand() < 0.1
with u,p
limit 50000
merge (u)-[:OWN]->(p);

Go nuts.

How to create billions of nodes in Neo4j?

2 Answers2

neo4j-import

parallel-batch-importer API

batch-inserter

Cypher