7

I need to calculate the CRC32, MD5 and SHA1 of the content of zip files without decompressing them.

So far I found out how to calculate these for the zip files itself, e.g.:

CRC32:

import zlib


zip_name = "test.zip"


def Crc32Hasher(file_path):

    buf_size = 65536
    crc32 = 0

    with open(file_path, 'rb') as f:
        while True:
            data = f.read(buf_size)
            if not data:
                break
            crc32 = zlib.crc32(data, crc32)

    return format(crc32 & 0xFFFFFFFF, '08x')


print(Crc32Hasher(zip_name))

SHA1: (MD5 similarly)

import hashlib


zip_name = "test.zip"


def Sha1Hasher(file_path):

    buf_size = 65536
    sha1 = hashlib.sha1()

    with open(file_path, 'rb') as f:
        while True:
            data = f.read(buf_size)
            if not data:
                break
            sha1.update(data)

    return format(sha1.hexdigest())


print(Sha1Hasher(zip_name))

For the content of the zip file, I can read the CRC32 from the zip directly without the need of calculating it as follow:

Read CRC32 of zip content:

import zipfile

zip_name = "test.zip"

if zip_name.lower().endswith(('.zip')):
    z = zipfile.ZipFile(zip_name, "r")

for info in z.infolist():

    print(info.filename,
          format(info.CRC & 0xFFFFFFFF, '08x'))

But I couldn't figure out how to calculate the SHA1 (or MD5) of the content of zip files without decompressing them first. Is that somehow possible?

paradadf
  • 123
  • 1
  • 7

1 Answers1

10

It is not possible. You can get CRC because it was carefully precalculated for you when archive is created (it is used for integrity check). Any other checksum/hash has to be calculated from scratch and will require at least streaming of the archive content, i.e. unpacking.

UPD: Possibble implementations

libarchive: extra dependencies, supports many archive formats

import libarchive.public as libarchive
with libarchive.file_reader(fname) as archive:
    for entry in archive:
        md5 = hashlib.md5()
        for block in entry.get_blocks():
            md5.update(block)
        print(str(entry), md5.hexdigest())

Native zipfile: no dependencies, zip only

import zipfile

archive = zipfile.ZipFile(fname)
blocksize = 1024**2  #1M chunks
for fname in archive.namelist():
    entry = archive.open(fname)
    md5 = hashlib.md5()
    while True:
        block = entry.read(blocksize)
        if not block:
            break
        md5.update(block)
    print(fname, md5.hexdigest())
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
Marat
  • 15,215
  • 2
  • 39
  • 48