I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.
On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.
Here are my ideas so far:
- Read the file in binary mode, throwing out bytes that equal
chr(26)
. This would work, but it would take approximately forever. - Use something like
sed
to eliminate the EOF characters. Unfortunately, as far as I can tell,sed
on Windows has the same problem and will quit when it sees the EOF. - Use some kind of
Notepad
program and do a find-and-replace. But it turns out thatNotepad
-type programs don't cope well with 15GB files.
My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?