An update to the below:
I have experienced a similar problem on an unrelated script, on a Debian virtual machine in a different datacentre.
This looks suspiciously like the issue described here (and like the person asking that question, I do not have a proxy configured in front of the server).
The main difference to the description below is that when I attach to the hung process, I see a call to recvfrom
rather than read
:
$ strace -p 17527
Process 17527 attached - interrupt to quit
recvfrom(3,
However Python is not under any impression that it is being proxied:
>>> import os; print os.getenv("HTTP_PROXY"), os.getenv("http_proxy")
None, None
So I'm still stumped. Sadly, the linked question is without a final answer as well.
(I'm also wondering if this question is related, but it seems unlikely that S3 would be failing to honor Connection: close
headers.)
I have several Debian (Wheezy, x86_64) servers all exhibiting the following behavior:
All of the servers have a set of cron jobs that, among other things, pull data from S3. These usually run fine, but occasionally ps aux
reveals that some of the jobs started hours or days ago are still running, and have not finished cleanly.
Inspecting them with strace -p <pid>
shows, in all cases, the process is hung on a read command. For example, the output of a process I checked just now was:
$ strace -p 12089
Process 12089 attached - interrupt to quit
read(5,
And checking the open file descriptors gives me this:
$ sudo lsof -i | grep 12089
python 12089 user 5u IPv4 809917771 0t0 TCP my.server.net:35427->185-201.amazon.com:https (ESTABLISHED)
At first I assumed this was just due to a lack of setting a read timeout in the Python scripts, but this does not appear to be the case, for a couple of reasons:
- This doesn't happen when the same jobs are run on our OS X boxes (all 10.5, i386), using identical code.
- A variant of the script that does set a timeout (of 60 seconds, using
socket.setdefaulttimeout
-- this is in Python 2.7, but the code base has to be 2.5 compatible) has been hung since yesterday. - Another process that isn't Python seems to occasionally exhibit similar behaviour. In this case, a Python script is executing an
svn up --non-interactive
process (usingsubprocess.Popen
, for what that's worth).
The situation with that SVN process is similar--
Python is waiting for SVN:
$ strace -p 28034
Process 28034 attached - interrupt to quit
wait4(28127,
And SVN is waiting for a read
call to complete:
$ strace -p 28127
Process 28127 attached - interrupt to quit
read(6,
And that read is pointing to another external host:
$ sudo lsof -i | grep 28127
svn 28127 user 3u IPv4 701186417 0t0 TCP my.server.net:49299->sparrow.telecommunity.com:svn (ESTABLISHED)
svn 28127 user 6u IPv4 701186439 0t0 TCP my.server.net:49309->sparrow.telecommunity.com:svn (ESTABLISHED)
(There appears to be an svn:externals
property set to ez_setup svn://svn.eby-sarna.com/svnroot/ez_setup
on the directory being updated; based on their website, I think this is redirecting to telecommunity.com)
Additional possibly-relevant points:
- The Python environment on the Macs is 2.5. On the Debian boxes, it is 2.7.
- I am not well versed in SVN, and I have no idea if the reason it is hanging is fundamentally the same thing or not. Nor am I entirely sure what the implications of
svn:externals
are; this was set up before my time. - The Python scripts themselves are retrieving large-ish (~10MB in some cases) chunks of data from Amazon S3, and this has a tendency to be slow (I'm seeing download times as long as three minutes, which seems long compared to how long it takes the servers -- even in different data centers -- to communicate with each other). Similarly, some of our SVN repositories are rather large. So that's basically to say that some of these operations are long-running anyway, however they also seem to get hung for hours or days in some cases.
- On one server, the OOM killer took out MySQL this morning. On closer inspection, memory usage was at 90%, and swap usage was at 100% (as reported by Monit); killing a large backlog of hung Python jobs reduced these stats to 60% and 40% respectively. This gives me the impression that at least some (if not all) of the data is being downloaded/read (and held in memory while the process hangs).
- These cron jobs are requesting a list of resources from S3, and updating a list of MySQL tables accordingly. Each job is started with the same list, so will try to request the same resources and update the same tables.
I was able to capture some traffic from one of the hung processes; it's all a bit inscrutable to me, but I wonder if it indicates that the connection is active and working, just very very slow? I've provided it as a gist, to avoid clutter (I should note that this is about two hours worth of capture): https://gist.github.com/petronius/286484766ad8de4fe20bThis was a red herring, I think. There's activity on that port, but it's not the same connection as the one to S3 -- it's just other random server activity.- I've tried to re-create this issue on a box in another data center (a VM running the same version of Debian with the same system setup) with no luck (I'd thought that perhaps the issue was related to this one, but the boxes experiencing these issues aren't VMs, and there are no dropped packets according to
ifconfig
). I guess this indicates a network configuration issue, but I'm not sure where to start with that.
So I guess my questions are:
- Can I fix this at a system level, or is this something going wrong with each individual process?
- Is there something fundamentally different about how OS X and Linux handle
read
calls that I need to know to avoid infinitely-hanging processes?