Efficient way of iterating through large amounts of data in Python

Question

I'm trying to run a dictionary attack on a sha512 hash. I know the hash is made up of two words, all lowercase, separated by a space. The words are from a known dictionary (02-dictionary.txt), which contains 172,820 words. Currently, my code is the following:

import hashlib
import sys
import time

def crack_hash(word, target):
    dict_hash = hashlib.sha512(word.encode())
    if dict_hash.hexdigest() == target:
        return (True, word)
    else:
        return (False, None)

if __name__ == "__main__":
    target_hash = sys.argv[1].strip()
    
    fp = open("02-dictionary.txt", "r")

    words = []
    start_time = time.time()
    for word in fp:
        words.append(word)
    fp.close()

    for word1 in words:
        for word2 in words:
            big_word = word1.strip() + " " + word2.strip()
            print(big_word)
            soln_found, soln_word = crack_hash(big_word.strip(), target_hash)
            if soln_found:
                print('Solution found')
                print("The word was:", soln_word)
                break

    end_time = time.time()
    total_time = end_time - start_time
    print("Time taken:", round(total_time, 5), "seconds")

However, when I run this code, the program runs incredibly slow. I know Python isn't the most efficient language, but I'm guessing the issue is more stemming from the data structure choice. Is there a more efficient data structure? I tried to use the array module, but the documentation makes it seem as though that is designed to be used for more primitive types (ints, floats, shorts, bools, chars, etc), and not for more complicated types like strings (or lists of chars). What is the best way of improving this code? In about an hour of runtime, I only got through around 1% of all the possible word combinations.

For starters, you don't need to compute `word1.strip()` in each iteration of the inner loop; that value doesn't change. But if there are 173,000**2 possible words, you need to check half of them on average to find the solution. There's not much you can do (beyond constant-factor speed-ups via parallelization) to speed that up. — chepner, Sep 18 '21 at 16:34
This is why keys that require brute-force approaches to breaking are considered secure. — chepner, Sep 18 '21 at 16:35
Remove those print statements, it adds to the runtime of the program, and modularise it into a function — mightyandweakcoder, Sep 18 '21 at 16:51
you also might want to consider preprocessing ```big_word``` into a text file, then reading it directly from the text file — mightyandweakcoder, Sep 18 '21 at 17:02
also your break statement only breaks out of the inner loop, modify that to break out of all loops — mightyandweakcoder, Sep 18 '21 at 17:04

President James K. Polk · Accepted Answer · 2021-09-18T20:24:17.823

The problem is that you're computing 178000² = 31684000000 (about 2³⁵) hashes. That's a lot of work. I've made a few changes to achieve some optimizations in pure python, but I suspect that the overhead of the hashlib calls is pretty significant. I think doing this all in native code would result in a more dramatic speedup.

The optimizations include the following:

Precomputing the words in the dictionary into bytes objects
Precomputing the partial hash result of the first part of the hash

import hashlib
import sys
import time


def try_all(words, target_hash):
    for word1 in words:
        hash_prefix = hashlib.sha512(word1 + b' ')
        for word2 in words:
            prefix_copy = hash_prefix.copy()
            prefix_copy.update(word2)
            # print(big_word)
            if prefix_copy.digest() == target_hash:
                print('Solution found')
                big_word = (word1 + b' ' + word2).decode('utf8')
                print(f'The word was: {big_word}')
                return


def read_all_words(filename):
    with open(filename, "rt") as f:
        return [line.strip().encode('utf-8') for line in f]


def get_test_hash(words):
    phrase = words[-2] + b' ' + words[-1]  # pick target towards end
    return hashlib.sha512(phrase).digest()


if __name__ == "__main__":
    words = read_all_words("02-dictionary.txt")
    TESTING = True
    if TESTING:
        words = words[:5000]  # reduce the size of the word list for testing only
        target_hash = get_test_hash(words)
    else:
        target_hash = bytes.fromhex(sys.argv[1].strip())
    start_time = time.time()
    try_all(words, target_hash)
    end_time = time.time()
    total_time = end_time - start_time
    print(f"Time taken: {round(total_time, 5)} seconds")
    print(f'{total_time / pow(len(words), 2)} seconds per hash')

On my laptop this runs at about 1.1 * 10^-6 seconds per hash, so trying all the words in the dictionary would take a little under 10 hours of CPU time.

Efficient way of iterating through large amounts of data in Python

1 Answers1