0

We have a web server that makes a number of http (actually https) requests to itself using curl. (This is because we have multiple websites running on the same server, and they use each other's APIs.)

We've found that when there is a large amount of other network traffic to the server—in particular when we're in the process of downloading a large file, but it can also occur during spikes of network traffic due to database replication or other reasons—these requests are very occasionally dropped; we get an instantaneous failure of the request (IE in 0 seconds) with no response returned. (No http code, nothing.)

While these failures are correlated with spikes in network traffic, the traffic level at these times still only appears to be around 60Mb/s; this is much higher than our baseline level, but should be well below what the network interface can handle. And since the requests that are failing are from and to the same server, it doesn't seem like any other network devices should be involved.

The failed requests do not appear in the Apache domlogs or error log as having been received, so they are being dropped before ever reaching Apache.

We have also tested with sending and receiving ends on separate servers, and have confirmed that the problem is on the receiving end, associated with high network bandwidth use on the receiving server. Therefore we suspect that it is not only these requests we're making internally that are failing. It's possible, and I think likely, the server is intermittently dropping other incoming requests as well, but since we're not making those requests we have no way of knowing.

Ideally looking for a possible explanation for why these requests would fail instantly. Is it normal under high network load for some requests to be dropped instantly as opposed to just being delayed? And given that we are making curl requests to another domain on the same server, would any traffic need to leave the server in this situation? (IE is it possible the bottleneck is elsewhere?) DNS should be cached, so basically the server will be making an http request to its own external IP address. I assume the NIC would be involved, but my understanding of networking is pretty rudimentary. Some kind of TCP/IP limit? It's not a connection limit, as it's triggered by a single high-bandwidth transfer, not a large number of separate connections. More generally, what are the possible limits it could be hitting, given it's before Apache?

The server is running CentOS 7.6 and Apache 2.4.

Nathan Stretch
  • 181
  • 2
  • 15
  • Hi, we would need to know the switch layout and model to get an idea if a bottleneck can impact you – yagmoth555 Aug 19 '19 at 17:35
  • 1
    The error report from IE doesn't say much about what happened at the network layer. Try getting a packet trace of the issue occurring, that should shed some light – Matt Zimmerman Aug 19 '19 at 19:05
  • Are you hitting the outer IP of the server when using curl or do you have a Cloudflare like reverse proxy you're hitting instead? – Ginnungagap Aug 19 '19 at 21:41
  • @yagmoth555 the NIC is a 'Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)' - any other info you need, please just let me know how to query for it. – Nathan Stretch Aug 19 '19 at 21:46
  • @MattZimmerman not sure what error report from IE you're referring to. A packet trace would be great, but is difficult considering how tough this is to replicate. Will continue trying though. – Nathan Stretch Aug 19 '19 at 21:47
  • @Ginnungagap no reverse proxy; we're making curl calls to a domain whose DNS points at the server's external IP. – Nathan Stretch Aug 19 '19 at 21:48

0 Answers0