I wrote a little Python program that looks though a directory (and its subdirectories) for files that contain non-ASCII characters.
I want to improve it. I know that certain files in this "directory" may be ZIP, DTA/OUT, OMX, SFD/SF3, etc... files that ARE SUPPOSED to have non-ASCII characters. So I want to know these are there and screen the ones that shouldn't contain ASCII characters, because my ultimate goal is to find files that should not contain non-ASCII characters that do and remove them (corrupt disk with bad sectors with TB worth of important data).
My thinking is to further look through the files that are in the "except" portion of a try/except block in Python that looks like this:
try:
content.encode('ascii')
output.write(str(counter) + ", " + file + ", ASCII\n")
print str(counter) + " ASCII file status logged successfully: " + file
counter += 1
except UnicodeDecodeError:
output.write(str(counter) + ", " + file + ", non-ASCII\n")
print str(counter) + " non-ASCII file status logged successfully: " + file
counter += 1
When I started to write the code, I realized that looping through asking if the file is '.zip'
or '.sfd'
pr '.omx'
, etc... would be a clunky program and take for ever.
Is there any way to search a group of file extensions other than one by one? Maybe a file containing these extensions to check against? Or something I haven't thought of? My apologies in advance if this is a stupid question, but there are so many cool functions in Python that I'm sure I'm missing something that can help.
Cheers.