1

I'm scraping a gazillion audio files from government websites and I want to avoid getting duplicate files. With small files I've scraped in the past, I download the entire file, compute a SHA1 hash for it and compare that against the items already in my database.

Since the files I'm downloading now are much larger, I'd like to just compute the SHA1 on the first 500kb of the file instead, so I can abort the download if it's something I already have.

I'm using the requests library to download the files...is there a logical way to approach this that'll give me consistent results without forcing me to re-download those files over and over?

EDIT I've been doing some research on this. One solution could be to use the HTTP range header, but I've tested the 221 government websites I'll be scraping and of them, only 56 support the range header. So much for that idea.

mlissner
  • 17,359
  • 18
  • 106
  • 169

0 Answers0