Inconsistent file behavior

Question

I'm trying to track down a Python UnicodeDecodeError in the following log line:

10.210.141.123 - - [09/Nov/2011:14:41:04 -0800] "gfR\x15¢\x09ì|Äbk\x0F[×ÐÖà\x11CEÐÌy\x5C¿DÌj\x08Ï ®At\x07å!;f>\x08éPW¤\x1C\x02ö*6+\x5C\x15{,ªIkCRA\x22 xþP9â\x13h\x01¢è´\x1DzõWiË\x5C\x10sòÊ¨R)¶²\x1F8äl¾¢{ÆNw\x08÷@ï" 400 166 0.000 "-" "-"

I opened the entire log file in Vim, and then yanked the line into a new file so I could test just the one line. However, my parsing script works OK with the new file - it doesn't throw a UnicodeDecodeError. I don't understand why the one file would generate an error and the other one would not, when they are (on the surface) identical.

Here's what I tried: running enca to determine the file encoding, which complained that it Cannot determine (or understand) your language preferences. file -i says that both files are Regular files. I also deleted every other line in the original log file and still got the error in one file and no error in the other. I tried deleting

set encoding=utf-8

from my .vimrc, writing the file again, and I still got the error in one file and not in the other.

The logs are nginx logs. Nginx has this note in their release notes:

*) Change: now the 0x00-0x1F, '"' and '\' characters are escaped as \xXX
   in an access_log.
   Thanks to Maxim Dounin.

My Python script has with open('log_file') as f and the error comes up when I try to call json.dumps on a dict.

How can I track this down?

And if you copy that line from this post, do you get the error? — agf, Nov 11 '11 at 05:04

score 1 · Answer 1 · answered Nov 11 '11 at 08:26

Your question: How can I track this down?

Answer:

(1) Show us the full text of the error message that you got -- without knowing what encoding that you were trying to use, we can't tell you anything. A traceback and a snippet of code that reads the file and reproduces the error would also be handy.

(2) Write a tiny Python script to find the line in the file and then do:

print repr(the_line) # Python 2.X
print ascii(the_line) # Python 3.x

and copy/paste the result into an edit of your question, so that we can see unambiguously what is in the line.

(3) It does look like random gibberish except for the  but do tell us whether you expect that line to be text (if so, in what human language?).

Inconsistent file behavior

1 Answers1