3

I'm new to parallel processing in python. I have a piece of code below, that walks through all directories and unzips all tar.gz files. However, it takes quite a bit of time.

import tarfile
import gzip
import os

def unziptar(path):
    for root, dirs, files in os.walk(path):
        for i in files:
            fullpath = os.path.join(root, i)
            if i.endswith("tar.gz"):
                print 'extracting... {}'.format(fullpath)
                tar = tarfile.open(fullpath, 'r:gz')
                tar.extractall(root)
                tar.close()

path = 'C://path_to_folder'
unziptar(path)

print 'tar.gz extraction completed'

I have been looking through some posts for multiprocessing and joblib packages but I'm still not v clear how to modify my script to run parallel. Any help is appreciated.

EDIT: @tdelaney

Thanks for the help, the surprising thing is that the modified script took twice the time to unzip everything (60mins compare to 30min with the original script)!

I look at the task manager and it appears that while multi-cores were utilised, the CPU usage is v low. I'm not sure why this is so.

enter image description here

Jake
  • 2,482
  • 7
  • 27
  • 51

1 Answers1

7

It's pretty easy to create a pool to do the work. Just pull the extractor out into a separate worker.

import tarfile
import gzip
import os
import multiprocessing as mp

def unziptar(fullpath):
    """worker unzips one file"""
    print 'extracting... {}'.format(fullpath)
    tar = tarfile.open(fullpath, 'r:gz')
    tar.extractall(os.path.dirname(fullpath))
    tar.close()

def fanout_unziptar(path):
    """create pool to extract all"""
    my_files = []
    for root, dirs, files in os.walk(path):
        for i in files:
            if i.endswith("tar.gz"):
                my_files.append(os.path.join(root, i))

    pool = mp.Pool(min(mp.cpu_count(), len(my_files))) # number of workers
    pool.map(unziptar, my_files, chunksize=1)
    pool.close()


if __name__=="__main__":
    path = 'C://path_to_folder'
    fanout_unziptar(path)
    print 'tar.gz extraction has completed'
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • oh, I think your example finally make things clear~ lemmi try out the script! – Jake Apr 10 '17 at 02:40
  • hmm I have a error: "SyntaxERror: invalid syntax." "pool.map(unziptar, my_files, chunksize=1)" Do you know why? – Jake Apr 10 '17 at 02:51
  • ah of cos, that closing brackets... Can I just ask one more query? I'm running on windows, and the script just went crazy~ and kept prompting to add "if __name__ == '__main__': freeze_support()". The message is "attempt to start a new process before current process has finished its boootstrapping phase." Would you know how to implement this? Thanks for the help again. – Jake Apr 10 '17 at 03:10
  • Yeah, Windows is a bit different. I'm on my phone at the moment but read the multiprocessing doc page and search for windows – tdelaney Apr 10 '17 at 03:12
  • Ok, I managed to start running it without errors~ cross fingers that it can process as intended! – Jake Apr 10 '17 at 03:18
  • The script ran finished but it took much longer than the original~ I have added in my post the modified script I ran with. Is there something I am not doing right? – Jake Apr 10 '17 at 04:58
  • Will set ur answer as correct since it did ran parallel, and I understand how Pool works now. :) If you have the time to look at why it ran twice as slow it will be great too :) I read from other posts that changing it to Queue and Process can significantly speed things up since each process requires time to initialize. But modifying my own script is beyond me at this point. – Jake Apr 10 '17 at 09:45
  • A Pool creates its processes once when it is created. You can have it discard and restart worker processes with `maxtasksperchild` but I didn't use that parameter. It then uses Queues to pass work back and forth. Your problem is that unzipping tar files is significantly I/O bound. You may find a benefit from 2 or maybe a few more cores, but after that, you are just thrashing I/O and burning through system RAM - potentially causing the operating system to page out stuff onto that same hard drive. That may explain the problem. – tdelaney Apr 10 '17 at 14:11
  • So, my code plus a few SSD drives, and it'll speed up! – tdelaney Apr 10 '17 at 14:14
  • thanks for the tip~ I just tried the script again on my 4-core mac (without maxtasksperchild) it ran x1.5 times faster! I think u have caught the bottleneck :D. Let me try again on my work desktop on Windows tomorrow and see if I can get similar results~ thanks a million! – Jake Apr 10 '17 at 15:24