My website loads more slowly than I think it should, due to a few of the assets taking an absurdly long time to download from the server. I've been trying to track down the cause of this. I'm about 95% sure it is a networking issue, not an Apache issue, due to the tests I've done (see below).
Here's a screenshot from Firefox's network inspector. Note that the stuck assets are usually some of these images, but it has occurred on other assets like Javascript files, etc.
Hypothesis and Question
My current theory is that our colo's bandwidth limit is causing packet loss when the browser downloads resources in parallel, perhaps momentarily above the bandwidth limit. Is this a sensible theory? Is there anything we can change apart from requesting more bandwidth, even though we don't use most of the bandwidth most of the time?
Or, is there some other avenue I need to be researching?
Configuration
- Apache 2.4.3 on Fedora 18, CPU and memory with lots of available capacity.
- Gigabit Ethernet to a switch to a 4 or 5 Mbps uplink via the colocation facility.
- It isn't a very high traffic site. Rarely more than a couple visitors at once.
Tests I've Done
traceroute
to the server is fine.traceroute
from the server to, say, our office does stop after 8 or so hops. I'm hypothesizing that this is due totraceroute
traffic getting blocked somewhere (since things likewget
—see below— andssh
seem to largely work fine), but I can provide more details if this is pertinent.strace
on Apache indicated that the server immediately was serving up the entire image without delay.tcpdump
/wireshark
showed that the image data was sent immediately, but then, later, some packets were retransmitted. One trace in particular showed that the final packet of the asset was transmitted immediately by the server, retransmitted several times, but the original packet was the one the browser finally received.- While I could sometimes reproduce the problem downloading the page via
wget
, it didn't happen as regularly as it did in the browser. My hypothesis is that this is becausewget
doesn't parallelize downloads. - Testing with
iperf
was interesting. Usingiperf
's UDP mode, I found that I had next to no packet loss at speeds up to about 4 Mbps. Above that, I began seeing ~10% packet loss. Similarly, in TCP mode, small numbers of parallel connections split the bandwidth sensibly between them. On the other hand 6 or more parallel connections started getting a "sawtooth" bandwidth pattern, where a connection would sometimes have bandwidth and not other times.
I'd be happy to provide more details on any of these, but I didn't want to crowd this post with details not pertinent. I'm hardly knowledgeable enough in networking to know what information is useful and what isn't. :-D Any pointers to other good network-troubleshooting tools would be swell.
EDIT 1: Clarified my near-certainty that Apache isn't to blame, but rather networking something-or-other.
EDIT 2: I tried iperf
between this server and another of ours on the same gigabit switch, and got a pretty consistent 940+ Mbits/s. I think that rules out most of the hardware problems or duplex mismatches on our end, except perhaps the uplink.
EDIT 3: While the specifics are very different, this post about a TCP incast problem sounds very similar, in terms of having high-bandwidth traffic shuffled down a narrow pipe in small bursts and losing packets. I need to read it in more detail to see if any of the specifics apply to our situation.