1

I wrote a little Python program that looks though a directory (and its subdirectories) for files that contain non-ASCII characters.

I want to improve it. I know that certain files in this "directory" may be ZIP, DTA/OUT, OMX, SFD/SF3, etc... files that ARE SUPPOSED to have non-ASCII characters. So I want to know these are there and screen the ones that shouldn't contain ASCII characters, because my ultimate goal is to find files that should not contain non-ASCII characters that do and remove them (corrupt disk with bad sectors with TB worth of important data).

My thinking is to further look through the files that are in the "except" portion of a try/except block in Python that looks like this:

try:
    content.encode('ascii')
    output.write(str(counter) + ", " + file + ", ASCII\n")
    print str(counter) + " ASCII file status logged successfully: " + file
    counter += 1 

except UnicodeDecodeError:
    output.write(str(counter) + ", " + file + ", non-ASCII\n")
    print str(counter) + " non-ASCII file status logged successfully: " + file
    counter += 1 

When I started to write the code, I realized that looping through asking if the file is '.zip' or '.sfd' pr '.omx', etc... would be a clunky program and take for ever.

Is there any way to search a group of file extensions other than one by one? Maybe a file containing these extensions to check against? Or something I haven't thought of? My apologies in advance if this is a stupid question, but there are so many cool functions in Python that I'm sure I'm missing something that can help.

Cheers.

nicorellius
  • 3,715
  • 4
  • 48
  • 79
  • I think there is a better solution than a simple exclude list, but so you know doing it that way wouldn't be slow, you are doing a simple regex or string comparison. – brc Nov 13 '11 at 04:27
  • It may help your state of mind to condition yourself on the correct terminology. On nearly all modern systems, files contain bytes, not characters. So you are looking for byte values 128 or greater. These are "non ASCII" bytes. If you also want to exclude controls other than newline, tab, etc. then you will look for certain byte values less than 32, and for 127. – wberry Nov 14 '11 at 17:34
  • Thanks for the lesson on terminology... I will try to think that way and maybe it will help my overall outlook on these types of problems. – nicorellius Nov 18 '11 at 18:24

1 Answers1

0

I figure since there aren't any answers I can go ahead and answer this myself with a partial answer. I basically took a different approach and looked for a particular file that is expected to be abundant for this share and then will do the same for each file. It's kind of hacky, but it will get the j ob done.

nicorellius
  • 3,715
  • 4
  • 48
  • 79