6

This is the strangest thing!

I have a multi-threaded client application written in Python. I'm using threading to concurrently download and process pages. I would use the cURL multi-handle except that the bottleneck is definitely the processor (not the bandwidth) in this application so it is more efficient to use a thread pool.

I have a 64b i7 rocking 16GB RAM. Beefy. I launch 80 threads while listening to Pandora and trolling Stackoverflow and BAM! The parent process sometimes ends with the message

Killed

Other times a single page (which is it's own process in Chrome) will die. Other times the whole browser crashes.

If you want to see a bit of code here is the gist of it:

Here is the parent process:

def start( ):
  while True:
    for url in to_download:
      queue.put( ( url, uri_id ) )

    to_download = [ ]

    if queue.qsize( ) < BATCH_SIZE:
      to_download = get_more_urls( BATCH_SIZE )

    if threading.activeCount( ) < NUM_THREADS:
      for thread in threads:
        if not thread.isAlive( ):
          print "Respawning..."
          thread.join( )
          threads.remove( thread )
          t = ClientThread( queue )
          t.start( )
          threads.append( t )

    time.sleep( 0.5 )

And here is the gist of the ClientThread:

class ClientThread( threading.Thread ):

  def __init__( self, queue ):
    threading.Thread.__init__( self )
    self.queue = queue

  def run( self ):
    while True:
      try:
        self.url, self.url_id = self.queue.get( )
      except:
        raise SystemExit

      html = StringIO.StringIO( )
      curl = pycurl.Curl( )
      curl.setopt( pycurl.URL, self.url )
      curl.setopt( pycurl.NOSIGNAL, True )
      curl.setopt( pycurl.WRITEFUNCTION, html.write )
      curl.close( )

      try:
        curl.perform( )
      except pycurl.error, error:
        errno, errstr = error
        print errstr

      curl.close( )

EDIT: Oh, right...forgot to ask the question...should be obvious: Why do my processes get killed? Is it happening at the OS level? Kernel level? Is this due to a limitation on the number of open TCP connections I can have? Is it a limit on the number of threads I can run at once? The output of cat /proc/sys/kernel/threads-max is 257841. So...I don't think it's that....

I think I've got it...OK...I have no swap space at all on my drive. Is there a way to create some swap space now? I'm running Fedora 16. There WAS swap...then I enabled all my RAM and it disappeared magically. Tailing /var/log/messages I found this error:

Mar 26 19:54:03 gazelle kernel: [700140.851877] [15961]   500 15961    12455     7292   1       0             0 postgres
Mar 26 19:54:03 gazelle kernel: [700140.851880] Out of memory: Kill process 15258 (chrome) score 5 or sacrifice child
Mar 26 19:54:03 gazelle kernel: [700140.851883] Killed process 15258 (chrome) total-vm:214744kB, anon-rss:70660kB, file-rss:18956kB
Mar 26 19:54:05 gazelle dbus: [system] Activating service name='org.fedoraproject.Setroubleshootd' (using servicehelper)
KeatsKelleher
  • 10,015
  • 4
  • 45
  • 52

2 Answers2

8

You've triggered the kernel's Out Of Memory (OOM) handler; it selects which processes to kill in a complicated fashion that tries hard to kill as few processes as possible to make the most impact. Chrome apparently makes the most inviting process to kill under the criteria the kernel uses.

You can see a summary of the criteria in the proc(5) manpage under the /proc/[pid]/oom_score file:

   /proc/[pid]/oom_score (since Linux 2.6.11)
          This file displays the current score that the kernel
          gives to this process for the purpose of selecting a
          process for the OOM-killer.  A higher score means that
          the process is more likely to be selected by the OOM-
          killer.  The basis for this score is the amount of
          memory used by the process, with increases (+) or
          decreases (-) for factors including:

          * whether the process creates a lot of children using
            fork(2) (+);

          * whether the process has been running a long time, or
            has used a lot of CPU time (-);

          * whether the process has a low nice value (i.e., > 0)
            (+);

          * whether the process is privileged (-); and

          * whether the process is making direct hardware access
            (-).

          The oom_score also reflects the bit-shift adjustment
          specified by the oom_adj setting for the process.

You can adjust the oom_score file for your Python program if you want it to be the one that is killed.

Probably the better approach is adding more swap to your system to try to push off the time when the OOM-killer is invoked. Granted, having more swap doesn't necessarily mean that your system will never run out of memory -- and you might not care for the way it handles if there is a lot of swap traffic -- but it can at least get you past tight memory problems.

If you've already allocated all the space available for swap partitions, you can add swap files. Because they go through the filesystem, there is more overhead for swap files than swap partitions, but you can add them after the drive is partitioned, making it an easy short-term solution. You use the dd(1) command to allocate the file (do not use seek to make a sparse file) and then use mkswap(8) to format the file for swap use, then use swapon(8) to turn on that specific file. (I think you can even add swap files to fstab(5) to make them automatically available at next reboot, too, but I've never tried and don't know the syntax.)

sarnold
  • 102,305
  • 22
  • 181
  • 238
  • Note that Chrome renderer processes volunteer for killing by writing hardcoded `300` to `/proc/self/oom_score_adj` regardless of your actual RAM. With modern systems, the Chrome process will be overcharged for many gigabytes of RAM while deciding the process to kill to free RAM. If you want your process to be killed before Chrome, you have to use value like `500`. For details, see https://bugs.chromium.org/p/chromium/issues/detail?id=333617 – Mikko Rantalainen Jul 20 '22 at 16:13
0

You are doing a

raise SystemExit

which actually exits the Python interpreter and not the thread you are running in.

Wessie
  • 3,460
  • 2
  • 13
  • 17