1

We have three dedicated servers at our company, one runs Nginx and acts as the web server (php), another handles MySQL and Memcached and the other is used to serve static files: css, js and images.

All servers show up as performing great on New Relic, specially the static files server:

  • CPU continuously under 10%
  • Network IO Received very low, transmitted is around 10 Mb/s tops, but the MySQL server has the same specs and routinely peaks at 20 mb/s, so doubt this is an issue.
  • Load average under 0.5

The problem is, at peak times, apparently the pictures (which can be 100kb - 200kb in size) take a long time to load for some users (many many seconds, even up to a minute sometimes, when usually it would take just a few seconds at worst).

Any idea what we could do? Ideally, if neither the CPU, RAM or bandwidth has reached any kind of limit, this shouldn't happen.

Any key Nginx config parameters we should be looking at (and probably changing)?

Scott Pack
  • 14,907
  • 10
  • 53
  • 83
manuelflara
  • 111
  • 2
  • [nmon](http://nmon.sourceforge.net/pmwiki.php) is a great tool for diagnosing exactly what's going wrong from the system's POV. Also, how many users do you get at "peak" times? Are we talking about 10-20 downloads at the same time or 1000-2000? – MikeyB Nov 20 '11 at 13:47
  • We have 500 - 700 concurrent users, but that could mean several thousand pictures being requested. Keep in mind in one page there can be anywhere from 10 to 100 (or more even) user avatars (small ones), and even on "picture pages" (like Facebook picture pages - a big image, then small avatars in teh comments) there's more than just the main image. Add to that that some (who knows how many) users open many tabs to open profiles or pictures all at once, and who knows. – manuelflara Nov 23 '11 at 13:33

2 Answers2

2

There are two possibilities I can think of.

  1. Your disk has hit it's I/O cap.
  2. You've hit the working thread limit in nginx. Look at the worker_* configuration parameters from the Core module and worker_connections from the Events module to figure out how to boost this. The default is a single worker process, which is single threaded so if you're running on a multi-cpu platform then you should definitely boost this. Even if you are on a single-cpu box, you will benefit from boosting this number on a machine serving static resources as you'll be disk I/O bound long before anything else, and other threads can be receiving and processing more requests while the first is sitting waiting to be fed data from the disk.
Matthew Scharley
  • 1,507
  • 3
  • 15
  • 19
  • Thanks for the answer. We had a 2048 limit multiplying both parameters, and we increased it to 16384, way more. The CPU still doesn't show much of a bump in our graphics (although about double the RAM is being used. Around 10% instead of the previous 5%). Still, the issue seems to persist. – manuelflara Nov 22 '11 at 21:07
1

We could sit here and guess at where your bottleneck is all day but some more general advice will help you find it on your own much sooner.

jeffatrackaid wrote this answer yesterday which is a more succinct version of what I wrote quite a while ago. I'd suggest reading those first to help understand how performance debugging is done.

In your case, I would use Firebug first to determine which bit of the request during the peak times is going slow. This should rule out bandwidth if bandwidth is not the true problem. Look in the "Net" section of Firebug and look at which part of the request changes between the fast times and the slow times.

Following that, I would run an strace with both the -t and the -T options on one of the nginx workers during one of these slow times. Analysis of the output of that should show you exactly where nginx is going slow. It useful to write the strace output to a file and then use less or grep on the file to identify system calls that took a long time.

You may get some use out of the -c option to strace.

Once you have identified the slow system calls, it can still be some work to figure out which nginx parameter needs changing but you should be well on your way. Please do come back and ask more specific questions if you need help with that part.

If it turns out to be a file-based system call, be sure to look backwards through the trace until you find the file that it was waiting for. That will be a big hint.

Ladadadada
  • 26,337
  • 7
  • 59
  • 90
  • Thanks for the answer. We've tried running an strace on peak times (when our issue seems to be happening, it's not like it happens to everyone) and we get A LOT of messages like this: recvfrom(86, 0x7026d0, 1024, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable). Any idea? We also get lots of "epoll_wait" but I think that's normal? – manuelflara Nov 22 '11 at 21:08