2

On an AWS Beanstalk deployment (single server) the Nginx server talking to the NodeJS/Express server on the same host occasionally complains about lost connections to upstream.

2020/03/23 10:52:43 [error] 11443#0: *70 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.31.46.70, server: , request: "GET /health-check HTTP/1.1", upstream: "http://172.17.0.3:33080/health-check", host: "172.31.39.242"
2020/03/23 10:52:48 [error] 11444#0: *580 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.31.21.226, server: , request: "POST /api/app/importNutriwebData HTTP/1.1", upstream: "http://172.17.0.3:33080/api/app/importNutriwebData", host: "******"
2020/03/23 10:52:50 [error] 11443#0: *526 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.31.21.226, server: , request: "GET /health-check HTTP/1.1", upstream: "http://172.17.0.3:33080/health-check", host: "172.31.39.242"

This happens without any apparent reason, including the /health-check URL which is a very simple response.send("OK");. It seems to happen for random URLs.

The upstream 172.17.0.3 is on the very same machine on which runs Nginx. All downstream connections come from CloudFront.

The same setup worked fine in the past 3-4 years, but these errors begin to increase since 2-3 days. I can't think of anything that may have changed, except maybe 10% more requests or so. There may be about 50 long-living EventStream connections but never more than 100 concurrent connections. I'm pretty sure the NodeJS server is fine.

I've also tried to upgrade Amazon Linux, reboot servers, rebuild the whole EBS deployment - nothing changed.

I can run an endless curl loop to the upstream URL (http://172.17.0.3:33080/health-check) or even the CloudFront => Nginx public URL and am unable to reproduce the problem despite trying thousands of requests (tests) for minutes.

The server has about 1.5 gigs of RAM free, CPU is at about 80% idle.

Open file handles seem low to me:

$ for pid in $(pidof nginx) ; do sudo ls /proc/$pid/fd | wc -w ; done
130
169
11

$ for pid in $(pidof node) ; do sudo ls /proc/$pid/fd | wc -w ; done
146

Could it bee that Nginx runs out of some sort of resources? Is it a timing problem? What can I do to debug this further?

Any help greatly appreciated.

Udo G
  • 443
  • 4
  • 9
  • 20
  • It looks like your NodeJS app is sending RST packet to nginx for some reason. Did you check app's logs? You could also try `tcpdump` and/or `strace` to capture network traffic and syscalls around the time the error happens; then investigate if there's something weird going on. This resource can be worth checking: http://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled/ – Juraj Martinka Mar 24 '20 at 07:07
  • Yes, I've checked the app's logs. Nothing unusual there. The server handles 2-3 quick requests per second, so is mostly idle. Will check that link, thanks. – Udo G Mar 24 '20 at 09:19
  • I increased NodeJS keepalive from 5 seconds to 2 minutes as suggested in the comments section of http://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled/#comment-2424 and so far it looks very promising. In fact, that page seems to describe exactly my problem. @JurajMartinka Perhaps you might want to add this as an answer to my question, so I could mark it as the solution? – Udo G Mar 24 '20 at 09:43

1 Answers1

4

It looks like your NodeJS app is sending RST packet to nginx for some reason. You could try tcpdump and/or strace to capture network traffic and syscalls around the time the error happens; then investigate if there's something weird going on.

This resource can be useful, which seems to describe a very similar issue, related to keepalive timeout: http://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled

Juraj Martinka
  • 495
  • 1
  • 3
  • 8
  • 3
    After setting `server.keepAliveTimeout = 2 * 60 * 1000;` (twice the Nginx timeout) as suggested [in the article](http://theantway.com/2017/11/analyze-connection-reset-error-in-nginx-upstream-with-keep-alive-enabled/#comment-2424) the problem disappeared. – Udo G Mar 24 '20 at 10:16
  • 2
    Also related: https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/ – jzavisek Nov 06 '20 at 10:47