4

I'm using zipfile to create an archive of all files in a directory (recursively, while preserving directory structure including empty folders) and want the process to skip the filenames specified in a list.

This is the basic function that os.walks through a directory and adds all the containing files and directories to an archive.

def zip_dir(path):
    zipname = str(path.rsplit('/')[-1]) + '.zip'
    with zipfile.ZipFile(zipname, 'w', zipfile.ZIP_DEFLATED) as zf:
        if os.path.isdir(path):
            for root, dirs, files in os.walk(path):
                for file_or_dir in files + dirs:
                    zf.write(os.path.join(root, file_or_dir),
                            os.path.relpath(os.path.join(root, file_or_dir),
                            os.path.join(path, os.path.pardir)))
        elif os.path.isfile(filepath):
            zf.write(os.path.basename(filepath))
    zf.printdir()
    zf.close()

We can see the code should also have the ability to handle single files but it is mainly the part concerning directories that we are interested in.

Now let's say we have a list of filenames that we want to exclude from being added to the zip archive.

skiplist = ['.DS_Store', 'tempfile.tmp']

What is the best and cleanest way to achieve this?

I tried using zip which was somewhat successful but causes it to exclude empty folders for some reason (empty folders should be included). I'm not sure why this happens.

skiplist = ['.DS_Store', 'tempfile.tmp']
for root, dirs, files in os.walk(path):
    for (file_or_dir, skipname) in zip(files + dirs, skiplist):
        if skipname not in file_or_dir:
            zf.write(os.path.join(root, file_or_dir),
                    os.path.relpath(os.path.join(root, file_or_dir),
                    os.path.join(path, os.path.pardir)))

It would also be interesting to see if anyone has a clever idea for adding the ability to skip specific file extensions, perhaps something like .endswith('.png') but I'm not entirely sure of how to incorporate it together with the existing skiplist.

I would also appreciate any other general comments regarding the function and if it indeed works as expected without surprises, as well as any suggestions for optimizations or improvements.

Dharman
  • 30,962
  • 25
  • 85
  • 135
noob
  • 328
  • 2
  • 13

1 Answers1

1

You can simply check if the file is not in skiplist:

skiplist = {'.DS_Store', 'tempfile.tmp'}

for root, dirs, files in os.walk(path):
    for file in files + dirs:
        if file not in skiplist:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

This will ensure that files in skiplist won't be added to the archive.

Another optimization is to make skiplist a set, just in case it gets very large, and you want constant time O(1) lookup instead of linear O(N) lookup from using a list.

You can research this more at TimeComplexity, which shows the time complexities of various Python operations on data structures.

As for extensions, you can use os.path.splitext() to extract the extension and use the same logic as above:

from os.path import splitext

extensions = {'.png', '.txt'}

for root, dirs, files in os.walk(path):
    for file in files:
        _, extension = splitext(file)
        if extension not in extensions:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

If you want to combine the above features, then you can handle the logic for files and directories separately:

from os.path import splitext

extensions = {'.png', '.txt'}
skiplist = {'.DS_Store', 'tempfile.tmp'}

for root, dirs, files in os.walk(path):
    for file in files:
        _, extension = splitext(file)
        if file not in skiplist and extension not in extensions:
            zf.write(os.path.join(root, file),
                     os.path.relpath(os.path.join(root, file),
                     os.path.join(path, os.path.pardir)))

    for directory in dirs:
        if directory not in skiplist:
            zf.write(os.path.join(root, directory),
                     os.path.relpath(os.path.join(root, directory),
                     os.path.join(path, os.path.pardir))) 

Note: The above code snippets won't work by themselves, and you will need to weave in your current code to use these ideas.

RoadRunner
  • 25,803
  • 6
  • 42
  • 75
  • @noob No worries, glad I could help. – RoadRunner Dec 22 '18 at 05:59
  • I'm however not sure I fully understand what you mean by "Another optimization is to make skiplist a list, just in case it gets very large, and you want constant time O(1) lookup instead of linear O(N) lookup from using a list.", could you expand on that please? – noob Dec 22 '18 at 06:01
  • 1
    @noob That was a type on my part, I've edited the answer. Do you know Big O Notation time complexity? It basically means that when you use a `set`, you can hash directly to the item you want, which is O(1). If you use a list, the underlying code has to iterate through the whole list to check if it exists. I've added a link in the answer to where these time complexities are documented. It might be worth studying Big O notation before looking at this though. Basically, anything that is O(1) is more efficient than O(N). – RoadRunner Dec 22 '18 at 06:07
  • Thanks for your reply. Just for clarification, that's a set you used in your Answer, right? Basically curly brackets { } instead of [ ] to take advantage of constant time O(1) lookup, if I've understood correctly. I'm not sure why to ever use lists [ ] then but perhaps that's a whole 'nother question. – noob Dec 22 '18 at 06:20
  • 2
    @noob Yes the curly brackets is a set. You can also use `set()` to define it. Sets are *unordered*, so if you need to maintain order, then using a list is the better option, since lists are *ordered*. For this question, since you only need to do lookups to check files/extensions to skip, then using a set is the better option. But for other scenarios, that might not be the case. It really depends on what problem your solving. – RoadRunner Dec 22 '18 at 06:22