We're trying to upgrade from the old RxJava-based Mongo driver mongodb-driver-rx (v1.5.0) to the newer mongodb-driver-reactivestreams (v1.13.1) - not the newest one because of dependencies, but certainly a lot newer. The old RxJava one has been end-of-life for years. Everything works correctly with the new driver, but under high load the performance is taking too big a hit and we can't explain why.
Some background info about our app:
Our (Java) app runs on AWS EC2 (at peak times around 30 m5.xlarge instances), and is based on a Vertx and RxJava stack. We are running a Mongo cluster (m5.12xlarge) with 1 primary and 2 secondaries. Typical number of simultaneous connections to Mongo at peak times is a few thousand. We have a gatling based load test in place which typically runs for 1 hour with 60 AWS EC2 instances, 1 primary Mongo and 2 secondaries like in production, and with 100k simultaneous users.
A few observations:
- Microbenchmarking a simple piece of integration testing code (which does a few common db operations) indicates no significant performance difference between the old and new driver.
- With the old driver we're seeing good performance overall in the load test, avg 20ms response time and 200ms response time within 99% percentile.
- With the new driver, running the same load test, things explode (over 2000ms avg response time, and eventually over 60% failed requests due to waiting queues getting full).
- If we run the load test with only 1 EC2 instance and 1.6k simultaneous users (which is the same load per instance), there is no significant performance difference between the old and new driver, and things run relatively smoothly.
MongoDB driver settings:
clusterSettings = "{hosts=[localhost:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, requiredReplicaSetName='null', serverSelector='LatencyMinimizingServerSelector{acceptableLatencyDifference=15 ms}', clusterListeners='[]', serverSelectionTimeout='30000 ms', localThreshold='30000 ms', maxWaitQueueSize=500, description='null'}"
connectionPoolSettings = "ConnectionPoolSettings{maxSize=100, minSize=0, maxWaitQueueSize=50000, maxWaitTimeMS=5000, maxConnectionLifeTimeMS=0, maxConnectionIdleTimeMS=300000, maintenanceInitialDelayMS=0, maintenanceFrequencyMS=60000, connectionPoolListeners=[]}"
heartbeatSocketSettings = "SocketSettings{connectTimeoutMS=10000, readTimeoutMS=10000, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}"
readPreference = "primary"
serverSettings = "ServerSettings{heartbeatFrequencyMS=10000, minHeartbeatFrequencyMS=500, serverListeners='[]', serverMonitorListeners='[]'}"
socketSettings = "SocketSettings{connectTimeoutMS=10000, readTimeoutMS=0, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}"
sslSettings = "SslSettings{enabled=false, invalidHostNameAllowed=true, context=null}"
writeConcern = "WriteConcern{w=null, wTimeout=null ms, fsync=null, journal=null"
Things we've tried: (all to no avail)
- Switching Mongo db version (we are currently still on 3.6, but we've tried 4.0 too);
- Adding a Vertx based RxJava scheduler around every db operation (we've tried
Schedulers.io()
, andRxHelper.scheduler(vertx)
) - Configuring Mongo settings with a
AsynchronousSocketChannelStreamFactoryFactory
containing aAsynchronousChannelGroup
with fixed threadpool of size 100; - Configuring Mongo settings with a
NettyStreamFactoryFactory
containing aNioEventLoopGroup
; - Playing around with the maximum Mongo connection pool per instance (varying from 100 to 500);
Things that cannot help us for now: (we know these, some of these are on our roadmap, but they would be too time consuming for now)
- Better index management (we've already optimized this, there are no queries that use an inefficient collscan)
- Splitting up the app into smaller services
- Easing the load on Mongo by employing in-memory JVM caching (Guava) or remote caching (Redis) - we already do this to some extent
- Getting rid of Vertx in favor of, for instance, Spring Boot
It seems like it is some kind of pooling or threading issue, but we can't pinpoint the exact problem, and profiling this kind of problem is also very hard.
Any thoughts on what may cause the problem and how to fix it?