3

My application is not live yet, so I'm testing the performance of my Gremlin queries before it gets into production.

To test I'm using a query that adds edges from one vertex to 300 other vertices. It does more things but that's the simple description. I added this mentioned workload of 300 just for testing.

If I run the query 300 times one after the other it takes almost 3 minutes to finish and 90.000 edges are created (300 x 300).

I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.

Then I noticed that when my test is executing the CPU doesn't show much activity, I don't know why, I don't know how the DB works internally. So I thought that maybe a more real situation could be to call my queries all at the same time because that is what's going to happen with the real users, when I tried test that I got ConcurrentModificationException.

Is far as I understand this error happens because an edge or vertex is being read or written in 2 queries at the same time, this is something that could happen a lot in my application because all the user vertices are changing connections to the same 4 vertices all the time, this "collisions" are going to be happening all the time.

I'm testing in local using gremlin server 3.4.8 connecting using sockets with Node.js. My plan is to use AWS Neptune as my database when it goes to production.

What can I do to recover hope? There must be very important things about this subject that I don't know because I don't know how graph databases work internally.

Edit

I implemented a logic to retry the queries requests when receiving an error using the "Exponential Backoff" approach. It fixed the ConcurrentModificationException but there are a lot of problems in Gremlin Server when sending multiple queries at the same time that shows how multi-threading is totally unsupported and unstable in Gremlin Server and we should try multi-threading in other Gremlin compatible databases as the response says. I experienced random inconsistencies in the data returned by the queries and errors like NegativeArraySize and other random stuff coming from the database, be warned of this to not waste time thinking that your code could be bugged like it happened to me.

fermmm
  • 1,078
  • 1
  • 9
  • 17

2 Answers2

3

While TinkerPop and Gremlin try to provide a vendor agnostic experience, they really only do so at their interface level. So while you might be able to run the same query in JanusGraph, Neptune, CosmosDB, etc, you will likely find that there are differences in performance depending on the nature of the query and the degree to which the graph in question is capable of optimizing that query.

For your case, consider TinkerGraph as you are running your tests there locally. TinkerGraph is an in-memory graph without transaction capability and is not proven thread safe for writes. If you apply a heavy write workload to it, I could envision ConcurrentModificationException easy to generate. Now consider JanusGraph. If you had tested your heavy write workload with that, you might have found that you were struck with tons of TemporaryLockingException errors if your schema had required a unique property key and would have to modify your code to do transactional retries with exponential backoff.

The point here is that if your target graph is Neptune and you have a traversal you've tested for correctness and now are concerned about performance, it's probably time to load test on Neptune to see if any problems arise there.

I'm worried because if I have like 60.000 users using my application at the same time they probably are going to be creating 90.000 edges in 2 minutes using this query and 60.000 users at the same time is not much in my case. If I have 1 million users at the same time I'm going to be needing many servers at full capacity, that is out of my budget.

You will want to develop a realistic testing plan. Is 60,000 users all pressing "submit" to trigger this query at the exact same time really what's going to happen? Or is it more likely that you have 100,000 users with some doing reads and perhaps every half second three of them happens to click the "submit".

Your graph growth rate seems fairly high and the expected usage you've described here will put your graph in the category of billions of edges quite quickly (not to mention whatever other writes you might have). Have you tested your read workloads on a graph with billions of edges? Have you tested that explicitly on Neptune? Have you thought about how you will maintain a multi-billion edge graph (e.g. changing the schema when a new feature is needed, ensuring that it is growing correctly, etc.)?

All of these questions are rhetorical and just designed to make you think about your direction. Good luck!

stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • 1
    Your answer is very useful, I had a lot of headaches trying to find bugs I thought I caused and were caused by multi-threading not being supported in the database, I think this should be one of the first things to be displayed in the documentation. I edited my question warning the visitors about my experience. – fermmm Oct 05 '20 at 20:44
  • consider using neo4j or janusgraph for your tests as they have good transaction support. overlfowdb might be another options, but i'm not sure if it's threadsafe or not - i've posted the question here to someone who would know: https://twitter.com/spmallette/status/1313272097712599043 – stephen mallette Oct 06 '20 at 00:18
  • I've tried janusgraph and I have the same problem, it's not thread safe eather, didn't try with neo4j yet – fermmm Oct 08 '20 at 22:12
  • JanusGraph should have no troubles at all when it comes to this sort of thing. Are you saying you are getting a `ConcurrentModificationException` with it? what is the entire strack trace? – stephen mallette Oct 09 '20 at 11:04
  • No, but I get ```java.lang.NullPointerException``` when reading vertices if the vertices were created sending multiple queries at the same time. If I create the vertices sending one at a time then there is no error. Also I have another query I send at the same time that creates many edges, then when reading these edges in another query I get repeated results or other random inconsistencies, it seems that sending multiple queries at the same time for saving data generates data corruption – fermmm Oct 11 '20 at 03:37
  • JanusGraph should have no troubles with multiple queries at the same time and a `NullPointerException` seems like a strange sort of problem to arise from a concurrent query situation. You might consider posting a question on the JanusGraph users mailing list with some context about you are doing. – stephen mallette Oct 11 '20 at 12:08
0

Although there is already an accepted answer I want to suggest another approach to deal with this problem.

Idea is to figure out whether you really need to do synchronous writes on the Graph. I'm suggesting that when you receive a request, just use the attributes in this request to fetch the sub-graph / neighbors, and continue with the business logic.

Simultaneously put the event in an SQS or something to have the writes taken care of by an Asynchronous system - say AWS Lambda. Because in SQS + Lambda you can decide the write concurrency to your systems' comfort (low enough that your query does NOT cause above exception).

Further suggestion: You have abnormally high write blast radius - your query should NOT touch that many nodes while writing. You can try converting some of the edges to vertices in order to reduce the radius. Then while inserting a node you'll just have to make one edge to the vertex which was previously an edge instead of making hundreds of edges to all the vertices this node is related to. I hope this makes sense.

Rahul Nimbal
  • 525
  • 7
  • 11