0

I'm trying to read what is supposed to be a cp1252 file according to Sublime Text3 and I'm getting the UnicodeEncodeError.

with codecs.open(config_path, mode='rb', encoding='cp1252') as f:
        lines = f.readlines()

UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 15: character maps to <undefined>

I can read the file if I change the encoding to latin-1 which is a bit weird...I'm fairly new to encode/decode stuff and if I open the file in notepad++/ST3/excel it is just an incomprehensible list of what it's look like to be binary data to me.

with codecs.open(config_path, mode='r', encoding='latin-1') as f:
    lines = f.readlines()

    for l in lines:
       utf_line = l.encode("utf-8")

print(utf_line)
b"\x00\x03'\xc2\x9a\x00\x03'\xc2\x9a\x00\x03&\xc3\xba\x00\x03'\xc3\x9a\x00\x03'?\x00\x03'\xc2\xbd\x00\x03't\x00\x03'\xc2\xb2\x00\x03'\xc3\xac\x00\x03'\xc3\x9b\x00\x03'1\x00\x03'\xc2\x98\x00\x03'M\x00\x03'o\x00\x03'\xc3\x8b\x00\x03'\xc2\xbf\x00\x03'd\x00\x03'\xc2\xbf\x00\x03'\xc3\xb0\x00\x03'1\x00\x03'\xc2\x9f\x00\x03'\xc2\x9f\x00\x03'V\x00\x03'\xc2\xa0\x00\x03'G\x00\x03'\x15\x00\x03'u\x00\x03'\xc2\xae\x00\x03'`\x00\x03'|\x00\x03'\x17\x00\x03'Q\x00\x03'8\x00\x03'\xc2\x94\x00\x03':\x00\x03'4\x00\x03'P\x00\x03'\xc2\x9d\x00\x03'\xc2\x9f\x00\x03''\x00\x03'\xc3\x92\x00\x03't\x00\x03'\xc3\xb3\x00\x03'l\x00\x03'c\x00\x03'2\x00\x03'i\x00\x03'C\x00\x03'=\x00\x03'\x0f\x00\x03'\xc3\x89\x00\x03'\xc3\x8a\x00\x03'\xc2\xb7\x00\x03'`\x00\x03'T\x00\x03'\xc2\x90\x00\x03'\xc3\x9b\x00\x03'\xc2\x90\x00\x03'y\x00\x03'?\x00\x03'\xc2\x92\x00\x03'\xc3\xad\x00\x03'g\x00\x03'\xc2\x84\x00\x03'@\x00\x03'\xc2\xa9\x00\x03'q\x00\x03'L\x00\x03'\xc2\xae\x00\x03'

Here is the file

As suggested I've tried to use chardet as follow:

with open(config_path, mode='rb') as f:
    lines = f.read()
    encoding = chardet.detect(lines)
    print(encoding)
{'encoding': None, 'confidence': 0.0, 'language': None}

If I'm testing each line I'm getting a bunch of encoding: cp1252, cp1253, ascii...

Thank you

beni
  • 105
  • 3
  • 9
  • It would be best if you could post the file. MrFruppes is correct that you don't want to be using 'b' with an explicit encoding. I'll also note that it's unnecessary to use codecs.open() here -- but let's see the file. – kerasbaz Aug 18 '20 at 09:59
  • 1
    I don't think it's a good idea to use whatever encoding Sublime Text guess, unless the file explicitly tells that it's encoded in cp1252. – bastantoine Aug 18 '20 at 10:05
  • you could have a look at the [chardet](https://chardet.readthedocs.io/en/latest/usage.html#basic-usage) package if you're unsure of the encoding. but that won't help much if it actually *is* a binary file (i.e. the bytes don't represent characters but arbitrary data, such as floats, integers etc.) ;-) – FObersteiner Aug 18 '20 at 10:05
  • @MrFuppes The modes work differently for [codecs.open](https://docs.python.org/3/library/codecs.html#codecs.open). Files are *always* opened in binary mode, so it makes no difference whether `b` is specified or not. (The builtin `open` raises a `ValueError` if an encoding argument is given when using binary mode - not a `UnicodeEncodeError`). – ekhumoro Aug 18 '20 at 10:33
  • The problem here seems to be that the file is not, in fact, encoded as cp1252, since some of the bytes map to undefined characters. – ekhumoro Aug 18 '20 at 10:39
  • @MrFuppes No, you are wrong. The arguments for builtin `open` are different to `codecs.open`, so it makes no sense to quote from the docs for it. The codecs module ***always*** returns text-mode file objects, because the whole point of it is to provide transparent encoding and decoding. – ekhumoro Aug 18 '20 at 10:49
  • Looking at the file in xxd, the first couple of lines include the words "ADVANTEST" and "COTDR". Looks like it's the output of some kind of device. It's mostly binary. It is not a text file. – snakecharmerb Aug 18 '20 at 15:34
  • Indeed the file comes from an instrument. I'm trying to "refurbish" an old home-made software in labview. I've quickly look for a way to read binary file with no success for now... – beni Aug 19 '20 at 07:37

0 Answers0