0

I am working on a one-to-one chat application running in production. It uses StropheJS to connect to Ejabberd server over BOSH (using ejabberd's default connection manager). The main problem we are facing is that sometimes it takes a long time for messages to reach the other end (~30 secs or so) while otherwise they reach in no time. Something like this --

User A sends a message
User B receives instantly
---- [some more message exchange that happens instantly] -----
User A sends message
No message received by B
User A sends another message
Still no message received by B
...
...
(20-30 secs later) B receives the two messages together (not as a single message but without any noticeable time interval between them)

Apart from the chat, the other parts of the web application work fine.

I am having a hard time figuring out what is the exact bottleneck. It's running on a Ubuntu 10.04 instance (2 GB memory + 4 GB Swap).

One thing I should mention is that a single machine is used for hosting everything -- apache2, mysql, ejabberd, rabbitmq, mongodb, the message queue workers and the python web app served by apache2 using mod_wsgi. Besides, apache also serves some (very few) static files and proxies the BOSH requests to ejabberd. At any time apache has the max process count (around 40) and using 700-800 MB of memory so my guess is that it's doing most of the work. Per day, it's serves an average of 200k requests (this figure is obtained from access logs)

We have moved static files to be served from CDN (which had improved performance significantly) and had also logged slow queries and optimized by creating indexes which again resulted in overall perf gain although I plan to do the exercise tomorrow again.

Is there a systematic approach that can be followed to arrive at the bottleneck.

I am also confused as to,

  • whether switching to nginx will improve performance
  • whether it's time to move stuff to their own servers, probably they may be competing for resources on the single machine?
  • upgrade the memory on the machine
  • load balancing the http server (although I am a bit doubtful about this since New Relic shows almost negligible time spent in request queuing.)
  • What kind of measurements can be done on the frontend/backend to get an idea?

PS: It would also be great to have some suggestions on books to read for understanding the basics of server management/architecture/tuning.

legoscia
  • 318
  • 1
  • 3
  • 14
naiquevin
  • 101
  • 1

1 Answers1

1

I'd say you're putting a lot on a single box. Memory is just one metric. You could be hitting CPU bottlenecks when you get traffic, or you're limiting I/O. iostat will give an idea on the disk activity. You'll probably see the issue go away if you move services to their own servers (have your web server separate from the jabber one).

Nathan C
  • 15,059
  • 4
  • 43
  • 62