0

I have an AWS ALB that load balances requests round-robin to four servers.

Each server uses pm2 to round-robin those requests to six CPUs.

NodeJS processes (react NextJS) are running on each of those six CPUs, served by Express.js. One of the first things they do is log the incoming request. (They are not fronted by a web server like apache or nginx, it goes straight to Express.js.)

Usually every single request that hits the ALB gets successfully forwarded, and logged by the NodeJS process. However, sometimes at high traffic times, some requests just get dropped and never make it to the NodeJS process. Obviously our server logs don't log these failures since they never make it there in the first place; we only see this gap by comparing to the ALB request counts.

I'm trying to understanding the mechanism that could lead to them being dropped. Could it be that a NodeJS internal queue times out? Or could it be a linux kernel thing? We are seeing indications that during periods of higher traffic, some of the CPUs are busy while others are idle, which makes me think of queue length (kingmans formula, little's law, etc). I can think of a few ways to decrease the probability of this happening, from increasing server capacity, to reducing response time, to changing the server-level load balancing strategy, but I'm more trying to understand where the request actually gets stuck and what determines whether and how it drops/disappears - especially if I could log it or send some kind of signal when it happens.

Snippets of pm2 config:

module.exports = {
  apps: [
    {
      name: 'community',
      script: 'dist/server.js',
      instances: -1,
      exec_mode: 'cluster',
      autorestart: true,
      watch: false,
      log_date_format: 'YYYY-MM-DD HH:mm Z',
      max_memory_restart: '2G',
// ...
// and env-specific configs, such as
      env_production: {
        NODE_ENV: 'production',
        NODE_OPTIONS: '--max-old-space-size=3584 --max-http-header-size=16380',
        LOG_LEVEL: 'INFO',
        PORT: 3000,
      },
    },
  ],
  deploy: {
// ...
  },
};
tunesmith
  • 101
  • 1
  • Can you explain in more detail exactly how "Each server uses pm2 to round-robin those requests to six CPUs"? It would be preferable to just show your configuration for the entire stack, as it's not possible yet to rule out any part of it. – Michael Hampton Sep 09 '21 at 03:01
  • pm2 is a nodes process manager that acts as a cluster to farm work to the cpus. It load-balances these requests in a round-robin fashion. But my question is more general - in a scenario where traffic is sent to a server that has a nodejs process serving traffic, under what circumstances would the nodejs never serve that request? I'm seeing more requests on the lb level than on the server level. – tunesmith Sep 10 '21 at 20:38
  • I already know what pm2 is. I am waiting to see your configuration. – Michael Hampton Sep 10 '21 at 20:40
  • ah, thanks for clarifying. I added it to the question. – tunesmith Sep 10 '21 at 20:47

0 Answers0