how to avoid IOError while looping through a long list that opens many files in threads?

Question

I'm downloading access logs from Amazon S3. These are A LOT of small files. To reduce the time of download, I've decided to read each file in a thread.

This is my main method that first connects to S3, then iterates over each document, and reads each documents' content inside a separate thread.

def download_logs(self):
    """
    Downloads logs from S3 using Boto.
    """
    if self.aws_keys:
        conn = S3Connection(*self.aws_keys)
    else:
        conn = S3Connection()

    files = []
    mybucket = conn.get_bucket(self.input_bucket)
    with tempdir.TempDir() as directory:
        for item in mybucket.list(prefix=self.input_prefix):
            local_file = os.path.join(directory, item.key.split("/")[-1])
            logger.debug("Downloading %s to %s" % (item.key, local_file))
            thread = threading.Thread(target=item.get_contents_to_filename, args=(local_file,))
            thread.start()
            files.append((thread,local_file))

        elms = range(len(files))
        elemslen = len(elms)
        while elemslen:
            curr = random.choice(elms)
            thread, file  = files[curr]
            if not thread.is_alive():
                yield file
                elms.remove(curr)
                elemslen -= 1

As you can see this is a generator as it yields. The generator is processed by simply reading each file's content to concatenate them

        logs = self.download_logs()
        for downloaded in logs:
            self.concat_files(tempLog, downloaded)

The above code fails with the following Warning raised in the threads:

[2014-10-20 15:15:21,427: WARNING/Worker-2] Exception in thread Thread-710:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/viktornagy/.virtualenvs/vidzor/lib/python2.7/site-packages/boto/s3/key.py", line 1561, in get_contents_to_filename
    fp = open(filename, 'wb')
IOError: [Errno 24] Too many open files: u'/var/folders/7h/9tt8cknn1qx40bs_s467hc3r0000gn/T/tmpZS9fdn/access_log-2014-10-20-11-36-20-9D6F43B122C83BD6'

Of course, I could raise the number of open files, but I would rather limit the number of threads to something meaningful.

Now my question is how to achieve that? I have a loop that generates a list of threads. Once this loop is finished, then I digest the list, and check for closed threads that can yield.

If I limit the number of threads in the first loop, than I'll never have the list ready to start its digestion.

What you need to do is refactor this into a consumer-queue, where you queue up a bunch of 'files to be downloaded', and then have a set of 'consumer threads' that pop items off this queue, download them/processes them, then mark the file as processed and move on to the next one. Python's [Queue](https://docs.python.org/2/library/queue.html) class is made for exactly this purpose. — aruisdante, Oct 20 '14 at 14:06
Even though, I did accept @dano's answer, I've ended up with a different kind of refactoring along the lines of http://stackoverflow.com/questions/11983938/python-appending-to-same-file-from-multiple-threds — Akasha, Oct 20 '14 at 15:24

score 2 · Accepted Answer · answered Oct 20 '14 at 14:13

You can use multiprocessing.dummy to create a pool of threading.Thread objects, and distribute the work to the threads in the Pool:

from multiprocessing.dummy import Pool

def download_logs(self):
    """
    Downloads logs from S3 using Boto.
    """
    if self.aws_keys:
        conn = S3Connection(*self.aws_keys)
    else:
        conn = S3Connection()

    files = []
    mybucket = conn.get_bucket(self.input_bucket)
    pool = Pool(20) # 20 threads in the pool. Tweak this as you see fit.
    with tempdir.TempDir() as directory:
        results = pool.imap_unordered(item.get_contents_to_filename,
                                      [os.path.join(directory, item.key.split("/")[-1]
                                          for item in mybucket.list(prefix=self.input_prefix)]

        for result in results:
            yield result

I'm using imap_unordered so that you can start yielding results as soon as they arrive, rather than needing to wait for all the tasks to complete.

how to avoid IOError while looping through a long list that opens many files in threads?

1 Answers1