Multithreading in zip password cracker

Question

I am learning how to crack zip files using dictionary attacks. This is the code:

import zipfile
from threading import Thread

def extractFile(zFile, password):
    try:
        zFile.extractall(pwd=password)
        print '[+] Found password ' + password + '\n'
    except:
        pass

def main():
    zFile = zipfile.ZipFile('evil.zip')
    passFile = open('dictionary.txt')
    for line in passFile.readlines():
        password = line.strip('\n')
        extractFile(zFile, password)

if __name__ == '__main__':
    main()

I use threading on it

import zipfile
from threading import Thread

def extractFile(zFile, password):
    try:
        zFile.extractall(pwd=password)
        print '[+] Found password ' + password + '\n'
    except:
        pass

def main():
    zFile = zipfile.ZipFile('evil.zip')
    passFile = open('dictionary.txt')
    for line in passFile.readlines():
        password = line.strip('\n')
        t = Thread(target=extractFile, args=(zFile, password))
        t.start()

if __name__ == '__main__':
    main()

However, when I time the two programmes, it takes 90 seconds to complete the first but nearly 300 seconds to complete the second. The dictionary contains 459026 entries. I am baffled as to why this happens. I also tried limiting the threads to 10,20,so on. But still the loop performs faster at each instance. Can anybody explain why this is so?? Also is there any chance to improve the program at all.

EDIT I tried slicing as suggested by Ray as follows:

import zipfile
from threading import Thread

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

def extractFile(zFile, passwords):
    for password in passwords:
        try:
            zFile.extractall(pwd=password)
            print '[+] Found password ' + password + '\n'
            sys.exit(0)
        except:
            continue

def main():
    zFile = zipfile.ZipFile('evil.zip')
    with open('dictionary.txt', 'rb') as pass_file:
        passwords = [i.strip() for i in pass_file]
    passes = list(chunks(passwords, 10))
    for pas in passes:
        t = Thread(target=extractFile, args=(zFile, pas))
        t.start()

if __name__ == '__main__':
    main()

Still takes 3-4 mins

We are only seeing the source code to the loop version; where is the multithreaded one? — Selcuk, Feb 13 '16 at 08:13
Short story, `threading` does not make your code run on multiple cores in CPython, you'd want to use `multiprocessing` for that. — Antti Haapala -- Слава Україні, Feb 13 '16 at 08:14
You are limited by the GIL: https://wiki.python.org/moin/GlobalInterpreterLock — Klaus D., Feb 13 '16 at 08:15
@Selcuk: That was my bad. I was making a few edits, and took away the threaded version of the block on a bad mis-click. I've restored it now. — code_dredd, Feb 13 '16 at 08:15
@Ekoji: See above. It was an error on my part and I've restored it. Apologies. — code_dredd, Feb 13 '16 at 08:16

Antti Haapala -- Слава Україні · Accepted Answer · 2016-02-13T08:32:34.493

2

One reason why this does not work properly for multiprocessing; you must open the zip file in each subprocess, otherwise you can be hurt by sharing filehandles. Then create only a handful (say 2 * number of cores) subprocesses, and let a single subprocess test multiple passwords.

Thus we get:

import zipfile
from multiprocessing import Process


def extract_file(passwords):
    with zipfile.ZipFile('evil.zip') as zipf:
        for password in passwords:
            try:
                zipf.extractall(pwd=password)
                print('[+] Found password {}\n'.format(password))
            except Exception as e:
                pass


def main():
    with open('dictionary.txt', 'rb') as pass_file:
        passwords = [i.strip() for i in pass_file]

    N_PROC = 8
    for i in range(N_PROC):
        p = Process(target=extract_file, args=[passwords[i::N_PROC]])
        p.start()


if __name__ == '__main__':
    main()

edited Feb 13 '16 at 08:32

answered Feb 13 '16 at 08:20

Antti Haapala -- Слава Україні

129,958
22
279
321

Thank you for the answer. Could you give me a link to sample code which I can use to change my cracker? – Echchama Nayak Feb 13 '16 at 08:23
I must admit I haven't tried limiting processes in multiprocess. I tried to find something. – Echchama Nayak Feb 13 '16 at 08:24
@Selcuk: Why create multiple copies when the compressed file could simply be made read-only and accessed as such? – code_dredd Feb 13 '16 at 08:40
This programme ran on my system for 37 seconds – Echchama Nayak Feb 13 '16 at 08:52
Also, you could have a condition flag that kills other threads when one found the password. – Antti Haapala -- Слава Україні Feb 13 '16 at 09:04
@AnttiHaapala Not just that, if I stop in the middle, there must be a way to restart. I will take it from here. Thanks a lot – Echchama Nayak Feb 13 '16 at 09:38

code_dredd · Answer 2 · 2016-02-13T08:53:21.170

Can anybody explain why this is so??

I think that, in addition to the problem of the Global Interpreter Lock (GIL), you might be using the threads incorrectly.

Judging from the loop, you're starting a completely new thread for every password line in your file -i.e. just to make a single attempt. Starting a new thread for only a single attempt is, as you've discovered, expensive and not working out as you expected. If you do this using multiprocessing, then it'll be even slower because creating a completely new process just to for a single try is even more expensive than creating a thread.

Is there any chance to improve the program at all?

I suggest you:

break up the passwords into several sub-lists/groups (i.e. slicing)
create a thread (or process) for each of these sub-lists
let each thread/process consume a group (i.e. make multiple attempts and get more out of them)

For example, if you have 100 lines in the file, you could break it up into 4 parts (i.e. 25 passwords per sub-list) and use these to feed 4 threads/processes (i.e. one for each sub-list).

Using multiprocessing here would be advantageous because you can avoid the GIL. However, keep in mind that you'd still have multiple processes accessing the same file simultaneously, so make sure you account for this when trying to extract the file, etc.

You should take care not to overwhelm your PC cores. You might want to use a process pool (see python docs) and limit the amount of processes you create to the number of cores in your PC as a maximum (perhaps your_core_count - 1 to keep it responsive).

Then, as each process consumes a sub-list and terminates, a new process is created (or existing one re-assigned, if using a process pool) to handle yet another sub-list waiting in your queue. If one of the children completes successfully, then you might want to get the parent process to kill all the other children to avoid unnecessary resource usage.

I tried slicing as shown in the edit in the main question. Is this what you suggested? — Echchama Nayak, Feb 13 '16 at 09:40
@Ekoji: Something along those lines, but not quite. I still think you should use *processes*, not threads, to get *parallelism* instead of concurrency, and that feeding only 10 passwords for each one is a waste. Use larger values. With your 400K+ entry dictionary, in a 4-core system, I'd start with ~100K entries per core (play around w/ values and measure). You must distribute the work *evenly* and make sure you give each process enough work to do. Until then, it'll probably be a waste. That's why your code is still slower than Antti's rewrite, based on my original set of suggestions. — code_dredd, Feb 13 '16 at 10:57

Multithreading in zip password cracker

2 Answers2