3

I've been working on a large, multi-server Node.js deployment. The tech stack:

Server 1 (Ubuntu 12.04):

  • Node.js API Server (Express app, used for input)
  • Node.js Push Server (100 workers, used to send out results)
  • Redis
  • Beanstalkd

Server 2-4 (Ubuntu 12.04):

  • Node.js Engine Server (150 workers per server, used for computation)

All Node.js apps are using Nodestalker as their Beanstalkd client.

Upon starting up all the servers, one or more of the Node.js apps will crash repeatedly with this error (LongJohn output):

Error: read ECONNRESET
    at errnoException (net.js:901:11)
    at onread (net.js:556:19)
---------------------------------------------
    at Readable.on (_stream_readable.js:681:33)
    at BeanstalkClient.command (/opt/app_deployment/engine/node_modules/nodestalkerib/beanstalk_client.js:248:13)
    at BeanstalkClient.watch    (/opt/app_deployment/engine/node_modules/nodestalker/l/beanstalk_client.js:285:14)
    at consumer (/opt/app_deployment/engine/scrape.js:52:12)
    at listOnTimeout (timers.js:110:15)
---------------------------------------------
    at Array.<anonymous> (/opt/app_deployment/engine/compute.js:215:9)
    at fire (/opt/app_deployment/engine/node_modules/jquery/lib/node-jquery.js:999:)
    at self.fireWith (/opt/app_deployment/engine/node_modules/jquery/lib/node-jquerjs:1109:7)
    at Object.<anonymous> (/opt/app_deployment/engine/node_modules/jquery/lib/node-uery.js:1236:16)
    at fire (/opt/app_deployment/common/node_modules/jquery/lib/node-jquery.js:999:)
    at self.fireWith (/opt/app_deployment/common/node_modules/jquery/lib/node-jquerjs:1109:7)
    at self.fire (/opt/app_deployment/common/node_modules/jquery/lib/node-jquery.js116:10)
    at /opt/app_deployment/common/results.js:18:19

The servers that successfully open all connections work flawlessly until manually restarted.

Each Engine server has 2 open Beanstalk clients per worker, and each push worker has a Beanstalkd client as well. This would result in ~1000 open connections to Beanstalk at any given time.

After research, it seemed as though I had hit the open-file descriptor limit (default 1024). However, no matter what I upped the limit to, the error still happened almost immediately after I restart the processes. A quick lsof showed no connection leaks.

As root, I have run ulimit -n 4096 on each user that runs the processes, which is accurately reflected in a ulimit -n immediately after.

I have also edited the soft and hard nofile limits in limits.conf for all relevant users. It may or may not be coincidence, but these values do not apply to the users upon server reboot.

My limits.conf:

beanstalkd soft nofile 4096
beanstalkd hard nofile 4096

After server reboot su beanstalkd and ulimit -n still shows 1024. I have session required pam_limits.so uncommented in /etc/pam.d/common-session and all other pam.d files.

In short, all signs point to hitting a file-descriptor wall, but no matter what the limit is upped to the errors still occur. Thanks in advance!

Alister Bulman
  • 34,482
  • 9
  • 71
  • 110
Swoop
  • 31
  • 2
  • That's a lot of connections that may be coming in at once - have you tried slowing things down a little? My other thought would be some sort of firewall rate limiting. – Alister Bulman Dec 21 '13 at 00:04
  • Have you tried to open twice as much connections to see if about half of them fail, to be sure its the limit your hitting instead of some rate-limit/concurrency problem – Paul Scheltema Dec 23 '13 at 20:43
  • @PaulScheltema We can't hit double-connections because as soon as we hit ~1024 limit the servers all start shutting down due to connection failures. – Swoop Dec 27 '13 at 04:57

0 Answers0