4

The project I am working on currently uses Neo4j community. Currently we process 1-5M vertices with 5-20M edges but we aim to handle a volume of 10-20M vertices w/ 50-100M edges. We are discussing the idea of switching to a graph database open source project that would enable us to scale in these proportion. Currently our mind is set on Janusgraph with Cassandra.

We have some questions regarding the capabilities and development of Janusgraph, we ould be glad if someone could answer! (Maybe Misha Brukman or Aaron Ploetz?)

On Janusgraph capabilities:

  • We did some experiments using the Janusgraph ready-to-use docker image, queries being issued through a java program. The java program and docker image are run on the same machine. At the magnitude of 10k-20k vertices with 50k-100k edges inserted, a query to with all the vertices possessing a give property takes 8 to 10 seconds (mean time over 10 identical queries, time elapsed before and after the command in the java program). The command itself is really simple:

    g.V().has("secText", "some text").inE().outV();

    Moreover, the docker image seems to break down when I try to insert more record (extending towards 100k vertices).

    We wonder if it's due to the limited nature of the docker image or if there is any problem or if it could be normal? Anyway it seems really, really slow.

  • We set up a 2 nodes Cassandra cluster (on 2 different VMs) with Janusgraph on town, again the results were quite slow.

  • From what I read on the Internet, people seem to use Janusgraph deployment with millions of vertices in production, so I guess they can execute simple queries in matter of milliseconds. What is the secret there? Do you need like 128GB of RAM for the whole thing to perform correctly? Or maybe there is a guide a good practices to follow that I am unaware of? I tried my best using Janusgraph official documentation and user comments on forums like here but that ain't much I'm afraid :/

On Janusgraph future:

  • Janusgraph seemed to evolve quite quickly over the first years (like 2016-2018) but the past few monthes I didn't see much activity from the Janusgraph community, except for the release of version 0.5 a few monthes ago. For example, no meeting since last year. So I'm wondering: is Janusgraph on the right tracks to last and be maintained for many years to come. Did things slow down a bit because of COVID or is there a thing?
  • Is backward compatibility considered in Janusgraph? From what I can read in the docs, many things have changed from version 0.2/0.3 to 0.4 and 0.5. Many are to come like, for example, Cassandra Thrift and embedded being deprecated. So, in a production environment where we can't always afford to update version every year, let aside the code modification in a case where some component is deprecated, does Janusgraph dev think of achieving some backward compatibility soon, or maybe should we still wait for the 1.0 version for that?

Thank you for reading all this and I am looking forward to all the answers you can give me :) have a nice day!

Mael

MaelC_fr
  • 41
  • 1
  • Hi, possibly your query executed as full scan, try check log for message like this `WARN transaction.StandardJanusGraphTx: Query requires iterating over all vertices [()]. For better performance, use indexes` or check `profile()` for your query. More info about indices https://docs.janusgraph.org/index-management/index-performance/ – mad Aug 27 '20 at 10:22
  • Sharing some blog which I found useful: https://www.experoinc.com/post/have-you-had-your-janusgraph-tuneup, though would appreciate more opinions from other experts on the 2 DBs in a production setting! – chaooder Oct 15 '20 at 01:09

2 Answers2

1

JanusGraph with Cassandra has design limitations at the storage layer which makes performance slow. In practice, its a large, scaleable, but slow graph database that offers the replication and redundancy benefits of Cassandra.

Cassandra shards data and is very good at distributing data randomly across the cluster, however this destroys data locality which is needed to make traversals fast and efficient. JanusGraph also supports several back-end storage options in addition to Cassandra, which means its not tightly tuned to any particular storage architecture.

Memory can make a difference, so verify how much memory you have allocated to the JVM on each node, use G1GC and disable swap. The VisualVM is helpful to profile your memory headroom.

Brad Schoening
  • 1,281
  • 6
  • 22
  • Hi, And thank you for the answer. I chose Cassandra for backend for the high scalability. Would you advise another backend that could handle be scalable to handle large volumes, in the order of 10M vertices, while being more respectful of data locality? Also, maybe there is something to be done with the Cassandra Partitioner? I think the one currently used in my deployment is Murmur3, but maybe another one would be best suited to execute graph traversals? – MaelC_fr Aug 26 '20 at 09:24
  • Did Neo4j not scale to 10M vertices? It doesn't sound like that should have been a problem. You can change the partitioner, but they're all designed to shard and don't understand the graph model. I've not worked with the other backends, but it could be worth a try. – Brad Schoening Aug 27 '20 at 02:55
0

Hello I know this might be late but please tell me. Are you accessing all the vertices for analysis or transactional queries ? OLAP or OLTP ? because how many vertices and edges you query and how you do that has a major effect. for example do you tell Janusgraph to return a vertex that have millions of edges with all those edges in one shot or only few of them. this is referred to as the hot vertex ( a vertex that has a lot of edges that cant be stored on one server instance ).

Ahmed Nader
  • 151
  • 5
  • 19