I've got a fleet of Java Vertx servers behind a load balancer that handles spikey traffic. One minute it may be handling 150k r/m, the next it may be handling 2mm r/m, then right back down to 150k r/m. I'm finding that during these spikes, the entire fleet may become unresponsive for minutes and drop connections, while the cpu and mem pressure on any one box barely hits 50% utilization.
To test what exactly is causing the outage, I setup a single test server that matches the specs of one in my production fleet to see how much I could throw at it before it gave out. My test involves using 10 other machines, each of which open 500 https connections to the server and send 1mm requests about 2kb per request payload. This totals in 5k concurrent connections opened, sending a total of 10mm requests, for roughly 20gb of data transfers.
Once the connections are opened I can fire off about 700k requests per minute. I monitor the servers availability simply by making a request to a health endpoint and recording the response time. The response time is fast, tens of milliseconds. I am happy with these results.
But before the flood of data starts coming in, theses 10 machines must first make 5k connections. During this time, the server is unresponsive and may even timeout when i try to check the health endpoint. I believe this is what is causing the outages in my production fleet- the sudden increase in new connections. Once the connections are established, the server has no trouble handling all of the data coming in.
I've update the nofile ulimit, net.core.netdev_max_backlog, net.ipv4.tcp_max_syn_backlog, and net.core.somaxconn, but it still hangs when receiving a burst of 5k new connection requests within a few seconds of each other.
Is there anything I can do to establish new connections quicker?
Edit:
The actual server runs in a docker container. My net settings aren't being applied to the container. Going to try that next and see if it makes a difference.
Edit Edit:
It's all in SSL. Making so many connections that quickly through plain HTTP is near instant. So i've got to figure out how to establish TLS connections quicker.
Edit Edit Edit:
I found that the native java security ssl handler was the bottleneck. Switching to netty-tcnative
(aka native OpenSSL) pretty much solved my problem with HTTPS.