Load testing bottleneck on nodejs with Google Compute Engine

Question

I cannot figure out what is the cause of the bottleneck on this site, very bad response times once about 400 users reached. The site is on Google compute engine, using an instance group, with network load balancing. We created the project with sailjs.
I have been doing load testing with Google container engine using kubernetes, running the locust.py script.

The main results for one of the tests are:

RPS : 30
Spawn rate: 5 p/s
TOTALS USERS: 1000
AVG(res time): 27500!! (27,5 seconds)

The response time initially is great, below one second, but when it starts reaching about 400 users the response time starts to jump massively.

I have tested obvious factors that can influence that response time, results below:

Compute engine Instances (2 x standard-n2, 200gb disk, ram:7.5gb per instance):

Only about 20% cpu utilization used
Outgoing network bytes: 340k bytes/sec
Incoming network bytes: 190k bytes/sec
Disk operations: 1 op/sec
Memory: below 10%

MySQL:

Max_used_connections : 41 (below total possible)
Connection errors: 0

All other results for MySQL also seem fine, no reason to cause bottleneck.

I tried the same test for a new sailjs created project, and it did better, but still had terrible results, 5 seconds res time for about 2000 users.

What else should I test? What could be the bottleneck?

It's likely in your node.js code. Probably something synchronous blocking the event loop that explodes as you run up requests/second. Try profiling it. — bbuckley123, Mar 02 '16 at 14:37
Thanks. I have been finding a lot of people saying that the globalAgent.maxSockets can be an issue. Current default is 5. Could this be the reason? @bbuckley123 — cfl, Mar 03 '16 at 06:24
The default is actually Infinity. It's been that way for several node releases: https://nodejs.org/api/http.html#http_agent_maxsockets — bbuckley123, Mar 03 '16 at 19:27
Our node version was 'old', still had the limit. But I did update node to a version with infinity MaxSockets, but that didn't seem to affect anything. — cfl, Mar 04 '16 at 06:14
I created a new sailsjs project and ran it through the load testing in the same way, the results were better, but still terrible!! The response time was seconds for about 2 thousand users already. Wondering if sailsjs itself is the issue? — cfl, Mar 04 '16 at 11:06
Profiling indicates that the bad response time comes from (idle). — cfl, Mar 08 '16 at 10:41
Do you run have tests for your project? In my tests, I can usually simulate a user sending a request, all the way to a response from the server. That usually indicates where there might be a bottleneck. — Bwaxxlo, Mar 09 '16 at 11:11
I test main parts of site with locust.py, requests for home url / and the most heavy lifting part of our site, most possible data processing. They random between 2 to 7 seconds for each request per user. Both have bad response times. Thanks for your response @Bwaxxlo 8 — cfl, Mar 09 '16 at 11:21
High latency can be difficult to localize, especially when connections require many steps like a load balancer, asynchronous runtimes, file access and DB connections. In addition, using GCE rules out the convenient use of [Cloud Trace](https://cloud.google.com/trace/). Would you be able to describe or demonstrate the flow of a given request and how each steps is handled in a distributed way? What types of connections are used? How and when in a request life span are you logging and profiling? The above may help point to specific issues rather than relying on common bottlenecks. — Nicholas, Mar 09 '16 at 15:58

score 4 · Answer 1 · answered Mar 07 '16 at 14:38

4

Are you doing any file reading/writing? This is a major obstacle in node.js, and will always cause some issues. Caching read files or removing the need for such code should be done as much as possible. In my own experience, serving files like images, css, js and such trough my node server would start causing trouble when the amount of concurrent requests increased. The solution was to serve all of this trough a CDN.

Another proble could be the mysql driver. We had some problems with connection not being closed correctly (Not using sails.js, but I think they used the same driver at the time I encountered this), so they would cause problems on the mysql server, resulting in long delays when fetching data from the database. You should time/track the amount of mysql queries and make sure they arent delayed.

Lastly, it could be some special issue with sails.js and Google compute engine. You should make sure there arent any open issues on either of these about the same problem you are experiencing.

answered Mar 07 '16 at 14:38

Stian

685
4
12

Barely any reading/writing in our project and the brand new sailsjs project also 'bombed out', which basically does no reads/writes. So even though CDN does help with optimization it doesn't show why our results are so bad? I checked the mysql instance, no bottlenecks there, barely any reads/writes. As mentioned we tried the brand new sailsjs project, which uses no db queries. The last point you mentioned is possible, I haven't found anything yet, but will keep searching. Thank you for your response – cfl Mar 08 '16 at 10:37
@cfl, are you sending POST or GET requests (I've read somewhere about a problem with POST in sails.js)? Could you create a repo to show what exactly you are doing? – amberv Mar 12 '16 at 15:00
Yes, that's what we use for our requests. We use that in our routes.js file. What kind of problem was it, do you remember perhaps? Shucks the repo is a tough one soz, cause of intellectual property issues I have. What exactly would you want in the repo? But to give an idea of the flow, we use sails.get("/get_info", message) which then goes through routes.js to call the controller method. The controller method then does the db queries etc with waterline and then responds back. @amberv – cfl Mar 14 '16 at 12:25
@cfl as far as I remember it was some strange idling of POST requests for XX ms compared to GET requests. I'm not sure if it the reason of your problems (and most likely it's not as it looks like you are doing GET), but you could compare response time of changing POST to GET requests (if you have any POST) while keeping everything else the same. It shouldn't be difficult anyway. Finally, for testing purposes, maybe you could create a new basic Express project and see if you hit similar limits? Then you'd conclude if it's caused by Express, not sails.js – amberv Mar 14 '16 at 12:48
@amberv Thanks for you help. I understand, we do use some post requests, but the load testing was only done on get requests. However I will keep this in mind and possibly use it for improvements. Good idea that is something worth testing, will post here if results were because of express not sails.js – cfl Mar 14 '16 at 15:15
@cfl, could you find out what the issue was? – amberv Mar 30 '16 at 21:52
@amberv Nope, I was even in communication with Google support, but nothing useful has come out of that. I have been able to work somethings out though. I removed sails.io and did some tests, this then showed the terrible rps was db related somehow. If I load test url that doesn't use db then it remains fast until 2000 users, but the load test which requires db select is terrible and above 10 secs response by 150 users. I've recently tried to use 2nd gen Google SQL, not improvements yet. So I'm playing around with pool, connectionLimit, waitForConnections settings on sails.js side. – cfl Mar 31 '16 at 06:44
1

@cfl maybe it would be interesting for you to set up a very basic pure node.js server with a few lines of code and see what kind of limits you will hit in that case? And if it's much better than the current 2000 limit you have, you can add a native db driver and do your selects during the request. Then you will know for sure the best responsiveness you'd be able to achieve. It's just an idea, but it doesn't seem to be complicated to achieve. And yet it could be quite informative. – amberv Mar 31 '16 at 13:33

Load testing bottleneck on nodejs with Google Compute Engine

1 Answers1