0

I am trying to benchmark three different graph database the Titan, the OrientDB and the Neo4j. I want to measure the execution time for the database creation. As a test case I use this dataset http://snap.stanford.edu/data/web-flickr.html . Although the data are stored locally I and not in the computers memory I have noticed it is consumed a lot of memory and unfortunately after a while eclipse crashes. Why is this happening?

Here's some code snippets: Titan graph creation

public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = titanGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = titanGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
                titanGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        titanGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

OrientDB graph creation:

public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;    
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = orientGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = orientGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
                orientGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        orientGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;

Neo4j graph creation:

public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
    long duration;
    long startTime = System.nanoTime(); 
    Transaction tx = neo4jGraph.beginTx();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Node srcNode = neo4jGraph.createNode();
                srcNode.setProperty("nodeId", parts[0]);
                Node dstNode = neo4jGraph.createNode();
                dstNode.setProperty("nodeId", parts[1]);
                Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
            }
            lineCounter++;
        }
        tx.success();
        reader.close();
    } 
    catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        tx.finish();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

EDIT: I tried the BatchGraph solution and it seems that it will run forever. It run the whole night yesterday and it never came to an end. I had to stop it. Is there anything wrong in my code?

TitanGraph graph = TitanFactory.open("data/titan");
    BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = batchGraph.getVertex(parts[0]);
                if(srcVertex == null) {
                    srcVertex = batchGraph.addVertex(parts[0]);
                }
                Vertex dstVertex = batchGraph.getVertex(parts[1]);
                if(dstVertex == null) {
                    dstVertex = batchGraph.addVertex(parts[1]);
                }
                Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
                batchGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
salvador
  • 1,079
  • 3
  • 14
  • 28
  • Can you share the stacktraces? What JVM memory parameters are you using? To print implicit settings use -XX:+PrintFlagsFinal. – Stefan Armbruster Nov 12 '13 at 08:04
  • I'll re-iterate on what @StefanArmbruster asked about your memory parameters to the JVM? also are you running these loads one at a time? Note that 'BatchGraph' will want to use memory to cache vertices so give as much as you can afford. You are doing a batchGraph.commit() on each line which is a problem. 'BatchGraph' will periodically handle commits for you based on the batch size you pass it in the constructor (currently 1000...you could likely go bigger). Make sure to call `shutdown()` on the graph at the end of your load to clean up final transactions. – stephen mallette Nov 13 '13 at 12:43
  • 1
    Also with Titan you will want to set `storage.batch-loading` equal to `true`. When you do that Titan will ignore locks and eliminate some reads. btw, you aren't doing something trivial here. Trying to load very large graph datasets to any graph requires strategies specific to that graph for the best load performance. Also, it might be worth including some logging in your code to keep track of how the load is progressing. – stephen mallette Nov 13 '13 at 12:47
  • @StefanArmbruster Those are my parameters -Xms64m -Xmx3200m . – salvador Nov 14 '13 at 11:36
  • @stephenmallette Nice. This worked fine. I wonder if there is any like the OrientGraphNoTx with the Titan project, because this needs no memory at all. – salvador Nov 14 '13 at 11:39
  • The best you can do with Titan is to turn on `storage.batch-loading`. I'm not sure there's much else you can do with Titan to bring down memory requirements during loading. Perhaps you could sacrifice some speed for memory by using a cache that doesn't keep growing. Maybe checkout guava cache: https://code.google.com/p/guava-libraries/wiki/CachesExplained Set the size to something reasonable for your memory constraints and go from there. – stephen mallette Nov 14 '13 at 12:57

3 Answers3

2

This answer just covers the Neo4j part.

You're basically running the full import in a single transaction. A transaction is build up in memory and committed to disc. Depending on the size of the data to be imported this might be a reason for the OOME. To handle this, I see 3 options:

1) use Neo4j batch inserter. This is a non transactional way to build up a Neo4j datastore. Since the other two snippets above don't use transactions I guess batch inserter is the best way to produce comparable results.

2) adopt memory parameters of your JVM

3) split up the transaction size. A typically good choice is to bundle 10k - 100k atomic operations into a transaction.

Side note: take a look at https://github.com/jexp/batch-import, this allows you to run a import directly from csv files without the need for java coding.

Stefan Armbruster
  • 39,465
  • 6
  • 87
  • 97
1

As you are trying to compare multiple databases I would recommend generalizing your code to Blueprints. The Flickr dataset looks like the right size for something like the BatchGraph Graph wrapper. With BatchGraph you can tune your commit sizes and focus on the code to manage the loading. In that way, you can have one simple class to load all the different Graphs (you could even extend your test to other Blueprints-enabled Graphs easily).

@Stefan makes a good point about memory...you likely need to boost your -Xmx settings on the JVM to deal with that data. Each Graph handles memory differently (even though they are persisting to disk) and if you are loading all three at once in the same JVM I could bet there is some contention there somewhere.

If you plan to go bigger than the Flickr dataset you referenced, then BatchGraph might not be right. BatchGraph is generally good to a few hundred million graph elements. When you start talking about graphs larger than that then you might want to forget some of what I said about trying to be non-graph specific. You will likely want to use the best tool for the job for each graph you want to test. For Neo4j, that means Neo4jBatchGraph (at least that way you are still using Blueprints if that's important to you), for Titan that means Faunus or a custom written parallel batch loader and for OrientDB OrientBatchGraph

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • The Flickr dataset is just a test case. I want to run this code with much more big datasets and I think that BatchGraph is not suitable for this job. What should I use in that case? Is there anything like transactions in Titan or OrientDB? Any code examples would be extremely helpful – salvador Nov 12 '13 at 15:14
  • Updated my answer to reflect larger graphs. I don't know of any examples unfortunately, but loaders i've built don't look so terribly different than your code other than the use of `BatchGraph`. Obviously a Titan-based parallel batch loader would require you to be able to break up your load into several separate processes so the code there is a bit different as well. The gpars(http://gpars.codehaus.org/) lib might be of use to your there. – stephen mallette Nov 12 '13 at 15:51
1

With OrientDB you can optimize this importing in 2 ways:

  1. using custom extension and
  2. avoid to use transactions at all

So open the graph using OrientGraphNoTx instead of OrientGraph, then try this snippet:

OrientVertex srcVertex = orientGraph.addVertex(null, "nodeId", parts[0] );
OrientVertex dstVertex = orientGraph.addVertex(null, "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");

Without calling .commit().

Lvca
  • 8,938
  • 2
  • 24
  • 25
  • hey luca...is OrientBatchGraph no longer recommended? – stephen mallette Nov 12 '13 at 21:24
  • I tried this solution and I think it gets a little bit faster, but the memory issue still exists. – salvador Nov 13 '13 at 07:57
  • OrientBatchGraph create bulks of Transactions that is OK but require more memory. Remember to follow the performance guidelines: https://github.com/orientechnologies/orientdb/wiki/Performance-Tuning-Blueprints#massive-insertion – Lvca Nov 13 '13 at 14:34
  • @Lvca this worked great. I have two problems though. It seems that this OrientVertex srcVertex = orientGraph.getVertex(parts[0]); don't works and the if the vertex is already added fails. So I have a lot of duplicates. Moreover, it seems that after the graph creation I can't retrieve the edges. Tell me if you need the code I am using for the Orient graph DB creation. – salvador Nov 14 '13 at 11:44
  • Can you post the code to the Official Support Community Group? http://groups.google.com/group/orient-database – Lvca Nov 16 '13 at 08:20