i have a pubsub application (mostly chat but some other goodies being pub-ed and sub-ed too) running on node & socket.io.
i'm load testing this app by spinning up some other, real large, boxes and running a node app i wrote for this purpose that spawns a ton of processes that connect using the socket.io-client package.
i found that i can get about 1k concurrent connections to a single 1gig rackspace cloud box. we need to support between 10k and 100k concurrent connections (for specific events, not all the time) though so i put a load balancer in front and figured before a big event i'd spin up more machines. but i've put a haproxy box in front and found that with 2 servers and 2k users i'm golden but with 4 servers even 3k users is a struggle!
i noticed that when my load tests start causing lots of disconnects the node servers are experiencing very high cpu usage (in the 90%) which i find odd because when 2 servers and 2k users i end up with a max of 70% which quickly diminishes.
here are some relevant lines from my haproxy config:
mode http
timeout client 86400000
timeout server 86400000
timeout connect 5000
maxconn 100000
i've also put some kernel net tuning into /etc/sysctl.conf on my haproxy and node boxes:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 50000
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_rmem = 8192 87380 8388608
net.ipv4.tcp_wmem = 8192 87380 8388608
and both the haproxy and node boxes have
ulimit -n 99999
in their relevant init scripts (before starting haproxy or node)
the haproxy box is consistently at single digit (or less) cpu usage.
what should my next steps be? does anything here stick out as an issue?