2

We are serving Django via nginx/uwsgi. We use AWS ELB behind a VPC.

This is a webservice to Mobile clients.

We had reports of timeouts during development so we have added a concept of a request_id to every request

  1. client generates request id
  2. sends GET /request?_request_id=ABDFEFE
  3. if an error occurs, the error is reported to the server along with the request id

The problem I am having now is that I am getting error reports of Timeouts that have occurred in the wild. However the associated request_ids do not show up in any nginx or uwsgi log.

I am a little suspicious of the amazon elastic load balancer. However I cannot be sure. Since it is a timeout we have no ELB headers to look at, or status code or response body or anything.

We use newrelic to monitor our backend. Occasionally it logs a 'slow transaction' of 3-4 seconds. (Nothing like the 30sec timeouts that are typical with most client libs)

The actual question: Where to look next? According to me and the data I have on the server NOTHING is actually wrong, however the timeouts persist. At this point I dont even know how to begin debugging this. The app servers are running at like 10% capacity (wrt to mem and cpu) Slow sql queries are being logged (with nothing interesting there either)

(I am also looking into client side error separately)

Thanks in advance.

semarjt
  • 21
  • 1

1 Answers1

0

Given the clients are mobile devices it's possible the issue isn't you or the ELB.

AWS ELBs don't just drop traffic on the floor; or at least not without letting you know. If the issue is in the ELB then you should see that reflected in the ELB metrics; ELB_5XX or Spillover would be the metrics I would check first in a case like this one. It's unlikely given that your monitoring is indicating that the service is online, though.

Nathan V
  • 711
  • 5
  • 16