I'm downloading access logs from Amazon S3. These are A LOT of small files. To reduce the time of download, I've decided to read each file in a thread.
This is my main method that first connects to S3, then iterates over each document, and reads each documents' content inside a separate thread.
def download_logs(self):
"""
Downloads logs from S3 using Boto.
"""
if self.aws_keys:
conn = S3Connection(*self.aws_keys)
else:
conn = S3Connection()
files = []
mybucket = conn.get_bucket(self.input_bucket)
with tempdir.TempDir() as directory:
for item in mybucket.list(prefix=self.input_prefix):
local_file = os.path.join(directory, item.key.split("/")[-1])
logger.debug("Downloading %s to %s" % (item.key, local_file))
thread = threading.Thread(target=item.get_contents_to_filename, args=(local_file,))
thread.start()
files.append((thread,local_file))
elms = range(len(files))
elemslen = len(elms)
while elemslen:
curr = random.choice(elms)
thread, file = files[curr]
if not thread.is_alive():
yield file
elms.remove(curr)
elemslen -= 1
As you can see this is a generator as it yields. The generator is processed by simply reading each file's content to concatenate them
logs = self.download_logs()
for downloaded in logs:
self.concat_files(tempLog, downloaded)
The above code fails with the following Warning raised in the threads:
[2014-10-20 15:15:21,427: WARNING/Worker-2] Exception in thread Thread-710:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/Users/viktornagy/.virtualenvs/vidzor/lib/python2.7/site-packages/boto/s3/key.py", line 1561, in get_contents_to_filename
fp = open(filename, 'wb')
IOError: [Errno 24] Too many open files: u'/var/folders/7h/9tt8cknn1qx40bs_s467hc3r0000gn/T/tmpZS9fdn/access_log-2014-10-20-11-36-20-9D6F43B122C83BD6'
Of course, I could raise the number of open files, but I would rather limit the number of threads to something meaningful.
Now my question is how to achieve that? I have a loop that generates a list of threads. Once this loop is finished, then I digest the list, and check for closed threads that can yield.
If I limit the number of threads in the first loop, than I'll never have the list ready to start its digestion.