1

I'm using python boto and threading to download many files from S3 rapidly. I use this several times in my program and it works great. However, there is one time when it doesn't work. In that step, I try to download 3,000 files on a 32 core machine (Amazon EC2 cc2.8xlarge).

The code below actually succeeds in downloading every file (except sometimes there is an httplib.IncompleteRead error that doesn't get fixed by the retries). However, only 10 or so of the 32 threads actually terminate and the program just hangs. Not sure why this is. All the files have been downloaded and all the threads should have exited. They do on other steps when I download fewer files. I've been reduced to downloading all these files with a single thread (which works but is super slow). Any insights would be greatly appreciated!

from boto.ec2.connection import EC2Connection
from boto.s3.connection import S3Connection
from boto.s3.key import Key

from boto.exception import BotoClientError
from socket import error as socket_error
from httplib import IncompleteRead

import multiprocessing
from time import sleep
import os

import Queue
import threading

def download_to_dir(keys, dir):
    """
    Given a list of S3 keys and a local directory filepath,
    downloads the files corresponding to the keys to the local directory.
    Returns a list of filenames.
    """
    filenames = [None for k in keys]

    class DownloadThread(threading.Thread):

        def __init__(self, queue, dir):
            # call to the parent constructor
            threading.Thread.__init__(self)
            # create a connection to S3
            connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
            self.conn = connection
            self.dir = dir
            self.__queue = queue

        def run(self):
            while True:
                key_dict = self.__queue.get()
                print self, key_dict
                if key_dict is None:
                    print "DOWNLOAD THREAD FINISHED"
                    break
                elif key_dict == 'DONE': #last job for last worker
                    print "DOWNLOADING DONE"
                    break
                else: #still work to do!
                    index = key_dict.get('idx')
                    key = key_dict.get('key')
                    bucket_name = key.bucket.name
                    bucket = self.conn.get_bucket(bucket_name)
                    k = Key(bucket) #clone key to use new connection
                    k.key = key.key

                    filename = os.path.join(dir, k.key)
                    #make dirs if don't exist yet
                    try:
                        f_dirname = os.path.dirname(filename)
                        if not os.path.exists(f_dirname):
                            os.makedirs(f_dirname)
                    except OSError: #already written to
                        pass

                    #inspired by: http://code.google.com/p/s3funnel/source/browse/trunk/scripts/s3funnel?r=10
                    RETRIES = 5 #attempt at most 5 times
                    wait = 1
                    for i in xrange(RETRIES):
                        try:
                            k.get_contents_to_filename(filename)
                            break
                        except (IncompleteRead, socket_error, BotoClientError), e:
                            if i == RETRIES-1: #failed final attempt
                                raise Exception('FAILED TO DOWNLOAD %s, %s' % (k, e))
                                break
                            wait *= 2
                            sleep(wait)

                    #put filename in right spot!
                    filenames[index] = filename

    num_cores = multiprocessing.cpu_count()

    q = Queue.Queue(0)

    for i, k in enumerate(keys):
        q.put({'idx': i, 'key':k})
    for i in range(num_cores-1):
        q.put(None) # add end-of-queue markers
    q.put('DONE') #to signal absolute end of job

    #Spin up all the workers
    workers = [DownloadThread(q, dir) for i in range(num_cores)]
    for worker in workers:
        worker.start()

    #Block main thread until completion
    for worker in workers:
        worker.join() 

    return filenames
Max
  • 1,218
  • 1
  • 12
  • 20

2 Answers2

4

Upgrade to AWS SDK version 1.4.4.0 or newer, or stick to exactly 2 threads. Older versions have a limit of at most 2 simultaneous connections. This means that your code will work well if you launch 2 threads; if you launch 3 or more, you are bound to see incomplete reads and exhausted timeouts.

You will see that while 2 threads can boost your throughput greatly, more than 2 does not change much because your network card is busy all the time anyway.

Jirka Hanika
  • 13,301
  • 3
  • 46
  • 75
  • Thanks @Jirka Hanika - changing to two threads seems to have resolved the issue. Though I think the Amazon machines are so beastly that actually having a large number of download threads does make things more efficient. I tried finding the AWS SDK >1.4.4, but the latest download on Amazon is 1.3.13... – Max Jul 12 '12 at 13:59
  • @Max - For most languages, 1.4 would be found [here](https://github.com/amazonwebservices/), and hopefully Python will appear there [as well one day](http://aws.typepad.com/aws/2012/01/big-news-regarding-python-boto-and-aws.html). Until then you might be out of luck, I am not sure. – Jirka Hanika Jul 12 '12 at 15:02
0

S3Connection uses httplib.py and that library is not threadsafe so ensuring each thread has it's own connection is critical. It looks like you are doing that.

Boto already has it's own retry mechanism but you are layering one on top of that to handle certain other errors. I wonder if it would be advisable to create a new S3Connection object inside the except block. It just seems like the underlying http connection could be in an unusual state at that point and it might be best to start with a fresh connection.

Just a thought.

garnaat
  • 44,310
  • 7
  • 123
  • 103