0

I'm downloading access logs from Amazon S3. These are A LOT of small files. To reduce the time of download, I've decided to read each file in a thread.

This is my main method that first connects to S3, then iterates over each document, and reads each documents' content inside a separate thread.

def download_logs(self):
    """
    Downloads logs from S3 using Boto.
    """
    if self.aws_keys:
        conn = S3Connection(*self.aws_keys)
    else:
        conn = S3Connection()

    files = []
    mybucket = conn.get_bucket(self.input_bucket)
    with tempdir.TempDir() as directory:
        for item in mybucket.list(prefix=self.input_prefix):
            local_file = os.path.join(directory, item.key.split("/")[-1])
            logger.debug("Downloading %s to %s" % (item.key, local_file))
            thread = threading.Thread(target=item.get_contents_to_filename, args=(local_file,))
            thread.start()
            files.append((thread,local_file))

        elms = range(len(files))
        elemslen = len(elms)
        while elemslen:
            curr = random.choice(elms)
            thread, file  = files[curr]
            if not thread.is_alive():
                yield file
                elms.remove(curr)
                elemslen -= 1

As you can see this is a generator as it yields. The generator is processed by simply reading each file's content to concatenate them

        logs = self.download_logs()
        for downloaded in logs:
            self.concat_files(tempLog, downloaded)

The above code fails with the following Warning raised in the threads:

[2014-10-20 15:15:21,427: WARNING/Worker-2] Exception in thread Thread-710:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/viktornagy/.virtualenvs/vidzor/lib/python2.7/site-packages/boto/s3/key.py", line 1561, in get_contents_to_filename
    fp = open(filename, 'wb')
IOError: [Errno 24] Too many open files: u'/var/folders/7h/9tt8cknn1qx40bs_s467hc3r0000gn/T/tmpZS9fdn/access_log-2014-10-20-11-36-20-9D6F43B122C83BD6'

Of course, I could raise the number of open files, but I would rather limit the number of threads to something meaningful.

Now my question is how to achieve that? I have a loop that generates a list of threads. Once this loop is finished, then I digest the list, and check for closed threads that can yield.

If I limit the number of threads in the first loop, than I'll never have the list ready to start its digestion.

Akasha
  • 2,162
  • 1
  • 29
  • 47
  • What you need to do is refactor this into a consumer-queue, where you queue up a bunch of 'files to be downloaded', and then have a set of 'consumer threads' that pop items off this queue, download them/processes them, then mark the file as processed and move on to the next one. Python's [Queue](https://docs.python.org/2/library/queue.html) class is made for exactly this purpose. – aruisdante Oct 20 '14 at 14:06
  • Even though, I did accept @dano's answer, I've ended up with a different kind of refactoring along the lines of http://stackoverflow.com/questions/11983938/python-appending-to-same-file-from-multiple-threds – Akasha Oct 20 '14 at 15:24

1 Answers1

2

You can use multiprocessing.dummy to create a pool of threading.Thread objects, and distribute the work to the threads in the Pool:

from multiprocessing.dummy import Pool

def download_logs(self):
    """
    Downloads logs from S3 using Boto.
    """
    if self.aws_keys:
        conn = S3Connection(*self.aws_keys)
    else:
        conn = S3Connection()

    files = []
    mybucket = conn.get_bucket(self.input_bucket)
    pool = Pool(20) # 20 threads in the pool. Tweak this as you see fit.
    with tempdir.TempDir() as directory:
        results = pool.imap_unordered(item.get_contents_to_filename,
                                      [os.path.join(directory, item.key.split("/")[-1]
                                          for item in mybucket.list(prefix=self.input_prefix)]

        for result in results:
            yield result

I'm using imap_unordered so that you can start yielding results as soon as they arrive, rather than needing to wait for all the tasks to complete.

dano
  • 91,354
  • 19
  • 222
  • 219