Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?

Question

I'm using Python 2.7.15, Windows 7

Context

I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text field that follows the > character. The script I wrote uses the:

    with open('fz.log','r') as rh:
       for lineno, line in rh: 
          pass

construct to read each line. That for-loop stopped prematurely when it encountered a log text field that contained the SOH and SUB characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.

My goal is to extract the IP addresses (which I can do using re.search()) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.

Reproducing the Issue

I reproduced the problem with this code:

if __name__ == '__main__':
    fn = 'writetest.txt'
    fn2 = 'writetest_NoControlChars.txt'

    # Create the problematic textfile
    with open(fn, 'w') as wh: 
        wh.write("This line comes first!\n");
        wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
        wh.write("This comes after!")

    # Try to read the file above, removing the SOH/SUB characters if encountered
    with open(fn, 'r') as rh:
        with open(fn2, 'w') as wh:
            for lineno, line in enumerate(rh):
                sline = line.translate(None,'\x01\x1A')
                wh.write(sline)
                print "Line #{}: {}".format(lineno, sline)
    print "Program executed."

Output

The code above creates 2 output files and produces the following in a console window:

Line #0: This line comes first!

Line #1: Blah
Program executed.

I step-debugged through the code in Eclipse and immediately after executing the

for lineno, line in enumerate(rh):

statement, rh, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after! to console and writing it out to writetest_NoControlChars.txt but neither events happened. Instead, execution jumped to print "Program executed". Picture of Local Variable values in Debug Console

You have to open this file in binary mode if you know it contains non-text data: `open(fn, 'rb')` — mvp, Nov 02 '18 at 04:07
Changing `with open(fn, 'r') as rh:` to `with open(fn, 'rb') as rh:` like you suggested worked. Thanks (I'd be happy to accept your answer if you post it as such)! — Minh T., Nov 02 '18 at 04:43
I dug a little deeper and the documentation for [open()](https://docs.python.org/3/library/functions.html#open) says opening a file in binary mode returns contents of the file as bytes without decoding. By default, `open()` operates in `'Text I/O'` mode, which is why it didn't work. I looked at the API for the [io module](https://docs.python.org/3/library/io.html#io.TextIOBase) but couldn't find anything that would explain why the file descriptor would be closed in text mode when those characters are encountered. Did I miss something? How can I dig deeper for an explanation? — Minh T., Nov 02 '18 at 04:46
Which operating system and Python version are you using? On my Mac with Python 2.7.11, your code above reports 'This comes after!' at the appropriate time. — Matthias Fripp, Nov 02 '18 at 08:30
`0x1A` is, on DOS and Windows systems, the End Of Text code `Ctrl-Z` for text mode I/O. — Jongware, Nov 02 '18 at 11:40
.. however, I can't find any mention of this in the source code of `_iobase` and `_textiobase`. So ultimately, this *may* piggy-back off the C standard library, which also does the same for Windows. — Jongware, Nov 02 '18 at 12:14
@usr2564301 That explains why if I wrote and read just `\x01` and not `\x01\x1A`, the program runs fine! — Minh T., Nov 03 '18 at 03:30

score 2 · Accepted Answer · answered Nov 02 '18 at 07:33

2

You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')

answered Nov 02 '18 at 07:33

mvp

111,019
13
122
148

What decides what is "non-text data"? Do you have a reference for that in the official documentation? – Jongware Nov 03 '18 at 15:15
1

I guess any byte stream which cannot be assigned some known text encoding (utf-8, latin1, windows-1251, etc) should be considered binary. Opening file in binary mode turns off any automatic parsing/processing that happens for text files. – mvp Nov 03 '18 at 19:03

Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?

1 Answers1