Loading 9M rows from Neo4j and writing it into CSV throws out of memory exception

Question

I have a big graph model and I need to write the result of following query into a csv.

Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId

When I "Explain" query, this is the result I get :

It shows that the estimated result is something around 9M. My problems are :

1) It takes alot of time to get a response. From neo4j-shell it takes 38 minutes! Is this normal? BTW I have all schema indexes there and they all are ONLINE.

2) When I use SpringDataNeo4j to fetch the result , it throws an "java.lang.OutOfMemoryError: GC overhead limit exceeded" error , and that happens when SDN tries to convert the loaded data to our @QueryResult object.

I tried to optimize the query in all different ways but nothing was changed ! My impression is that I am doing something wrong. Does anyone have any idea how I can solve this problem? Should I go for Batch read/write ?

P.S I am using Neo4j comunity edition Version:3.0.1 and these are my sysinfos:

and these are my server configs.

dbms.jvm.additional=-Dunsupported.dbms.udc.source=tarball
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=3G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=3G
neostore.propertystore.db.strings.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=500M
neostore.propertystore.db.index.mapped_memory=500M

score 7 · Answer 1 · answered Nov 16 '16 at 14:22

Although Neo4j will stream results to you as it matches them, when you use SDN it has to collect the output into a single @QueryResult object. To avoid OOM problems you'll need to either ensure your application has sufficient heap memory available to load all 9m responses, or use the neo4j-shell, or use a purpose-built streaming interface, such as https://www.npmjs.com/package/cypher-stream. (caveat emptor: I haven't tried this, but it looks like it should do the trick)

Thanks Vince , you gave me a good hint about the @QueryResult! — Lina, Nov 17 '16 at 11:45

score 4 · Answer 2 · answered Nov 16 '16 at 18:57

Your config settings are not correct for Neo4j 3.0.1

you have to set the heap in conf/neo4j-wrapper.conf, e.g. 8G

and page-cache in conf/neo4j.conf (looking at your store you only need 2G for page-cache).

Also as you can see it will create 8+ million rows.

You might have more luck with this query:

Match (u:USER)-[:PURCHASED]->(:ORDER)-[:HAS]->(i:ITEM) 
with distinct u,i
return u.id as userId,i.product_number as itemId

It also doesn't make sense to return 8M rows to neoj-shell to be honest. If you want to measure it, replace the RETURN with WITH and add a RETURN count(*)

Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) 
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)

Another optimization could be to go via item and user and do a hash-join in the middle for a global query like this:

Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM) 
USING JOIN ON o
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)

The other thing that I'd probably do to reduce the number of returned results is to try aggregation.

Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
WITH u, collect(distinct i) as products
WITH u.id as userId,[i in products | i.product_number] as items
RETURN count(*)

Thanks Michael , You are right , the dbms.memory.pagecache.size was set in a wrong file. I changed it. Regarding the number of returned results , I do not want to reduce them , I need them all. I need to create a csv of userId,itemId ... Can not use any grouping ... I tried all the queries you suggested (vi neo4j-shell ... ) all had the same response time. I think my data is not really normalized :-) but I need all of the duplications. — Lina, Nov 17 '16 at 11:41

score 2 · Accepted Answer · answered Nov 17 '16 at 11:54

Thanks to Vince's and Michael comments I found a solution ! After doing some experiments it got clear that the server response time is actually good ! 1.5 minute for 9 million data ! The problem is with SDN as Vince mentioned ! The OOM happens when SDN tries to convert the data to @QueryResult Object. Increasing heap memory for our application is not a permanent solution as we will have more rows in future ! So we decide to use neo4j-jdbc-driver for big data queries... & it works like a jet ! Here is the code example we used :

Class.forName("org.neo4j.jdbc.Driver");
    try (Connection con = DriverManager.getConnection("jdbc:neo4j:bolt://HOST:PORT", "USER", "PASSWORD")) {

        // Querying
        String query = "match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId";
        con.setAutoCommit(false); // important for large dataset
        Statement st = con.createStatement();
        st.setFetchSize(50);// important for large dataset

            try (ResultSet rs = st.executeQuery(query)) {
                while (rs.next()) {
                    writer.write(rs.getInt("userId") + ","+rs.getInt("itemId"));
                    writer.newLine();
                }

            }

        st.setFetchSize(0);
        writer.close();
        st.close();

    }

Just make sure you use " con.setAutoCommit(false); " and "st.setFetchSize(50)" if you know that you are going to load a large dataset. Thanks Everyone !

Loading 9M rows from Neo4j and writing it into CSV throws out of memory exception

3 Answers3