7

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.

On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.

Here are my ideas so far:

  1. Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
  2. Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
  3. Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.

My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?

Joel
  • 507
  • 4
  • 11
  • 1
    Have you looked at running `sed` under a pseudo-Unix environment like Cygwin? It was built for exactly this purpose, and I've got to image there's a way around EOF characters... – MattDMo Dec 20 '13 at 02:38
  • It's not Python that treats Ctrl+Z as EOF in text files: that's deep in the bowels of the Windows file system. It's impossible on Windows, in any programming language, to open a file in text mode and *not* have Ctrl+Z treated as end-of-file. – Tim Peters Dec 20 '13 at 03:00
  • 1
    @TimPeters, I don't think that's true - I dare you to find a binary/text flag in the Windows API [`CreateFile`](http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx). It's just that the conventions are so pervasive that it's hard to bypass them. – Mark Ransom Dec 20 '13 at 03:46
  • @MarkRansom, you could well be right! I never use the Windows API directly except when writing Windows-specific code for the Python implementation, so know as little about it as possible ;-) – Tim Peters Dec 20 '13 at 03:54

1 Answers1

7

It's easy to use Python to delete the DOS EOF chars; for example,

def delete_eof(fin, fout):
    BUFSIZE = 2**15
    EOFCHAR = chr(26)
    data = fin.read(BUFSIZE)
    while data:
        fout.write(data.translate(None, EOFCHAR))
        data = fin.read(BUFSIZE)

import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
    delete_eof(fin, fout)

That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.

By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • At this point in the project I am wary of assuming any sort of *intent* on the part of the people who provided the files. :P This definitely isn't my only problem, but it definitely is my *biggest* problem. – Joel Dec 20 '13 at 05:04