can't encode character '\u0144' even using encoding=utf-8 in python3

Question

I am trying to read some information from some .txt files, they are all in english and they do not have any other unicode character, the problem is that for an especific file it just crashed and do not show the information, the error is

Traceback (most recent call last):
  File "C:\users\bienvenido\desktop\programmacion\harvard\cs50 artificial inteligence\6\questions\questions.py", line 107, in <module>
    main()
  File "C:\users\bienvenido\desktop\programmacion\harvard\cs50 artificial inteligence\6\questions\questions.py", line 16, in main
    files = load_files(sys.argv[1])
  File "C:\users\bienvenido\desktop\programmacion\harvard\cs50 artificial inteligence\6\questions\questions.py", line 59, in load_files
    files[file] = f.read()
  File "C:\Users\BIENVENIDO\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 38619: character maps to <undefined>

and what i am doing is

with open(os.path.join(directory,file), encoding='utf-8') as f:
    files[file] = f.read()
    print(files[file])

I also tried utf16 and default encoding

first youc could `print(file)` to see which file makes problem and open it in normal editor to see what you have in file. When I run `print('\u0144')` then I get `ń` which is not English char - it is Polish char. — furas, Jun 11 '21 at 05:12
BTW: you should check if you have problem with `read()` or with `print()` - sometimes in Windows `print()` try to encode text with `latin1` or `cp1250` and it has problem to display `utf-8` — furas, Jun 11 '21 at 05:14
always put full error message (starting at word "Traceback") in question (not comment) as text (not screenshot, not link to external portal). There are other useful information. — furas, Jun 11 '21 at 05:14
Make a [mre]. Provide minimal file content (and its encoding) and the minimal code to produce the real error you see. What you've shown is not reproducible. Obviously `'\u0144'.encode('utf8')` works, so the question title is not representative of the problem as well. Since it is an encode (not decode) error, the likely cause is the `print` to a terminal that isn't configured for it, but the full traceback wasn't provided. — Mark Tolonen, Jun 11 '21 at 06:03
It is good practice to print (in question) the entire error stack, or at minimum starting from your code). The last item in stack (in this case) is inside a library, so it doesn't tell us much about where the error happen in your code. — Giacomo Catenazzi, Jun 11 '21 at 07:17
@furas the file do not have that char, it just have english characters — jhonny, Jun 11 '21 at 15:53
as I suggested before (and @GiacomoCatenazzi suggests in answer) it can be problem with `print()`. And as I said before: put FULL error message in question. At this moment we can only forget (or close) your question because there is no solution without more information. — furas, Jun 11 '21 at 15:58
@MarkTolonen, i am using a large file, and every time i run the script the problem seems to appear in a diferent position, i dont think that it will be a good idea to leave here all the file — jhonny, Jun 11 '21 at 16:03
@furas now i edit the question and now is the complete error message — jhonny, Jun 11 '21 at 16:03
error shows problem in `load_files()` which you didn't show in question. And error shows also `cp1252.py` which means it tries to read as `cp1252`, not `utf-8`. All this can means you don't use `encoding='utf-8'` — furas, Jun 11 '21 at 17:23
Did you *try* to find a single portion of a single file that reproduces the issue? If we can't reproduce it how do you expect it to be fixed? — Mark Tolonen, Jun 11 '21 at 17:24
error also shows problem with char in position 38619 so you could check what you have in file in position 38619. It also shows problem with code `0x9d` and only for `cp1250` I gives correct char `'ť'` - `print( b'\x9d'.decode('cp1250') )` which can means you have file with encoding `cp1250`, not `utf-8` — furas, Jun 11 '21 at 17:28
i did not put the portion of the file where the error is produced because it is just blank space ` Model assessments `, it is in the space before de s, @MarkTolonen, @furas — jhonny, Jun 12 '21 at 23:04
You could read the whole file in binary mode `'rb'`, then look at the position of the failure `data[38610:38630]`. Since you can’t make a reproducible example I am voting to close — Mark Tolonen, Jun 13 '21 at 04:52

score 0 · Answer 1 · answered Jun 11 '21 at 07:15

You are not using UTF-8 (not in the right part).

The problem is about encode part, so on the writing part (string to binary data/encoded string). On the other case you will have a "could not decode" error.

So not is not the "open", but the print. Not all consoles allow UTF-8, and Python (by default) use the encoding of console for standard output (which it is very sensible.

So, to check, instead of printing, just write to a temporary file, and check if it works (and if you have UTF-8 data). I assume this is the case (but check!).

In such case, you should check why your console is not UTF-8. Microsoft Windows is known to be the last large operating system where UTF-8 is not the default. You can look in this site on how to enable UTF-8 on various terminals/consoles/power shells/tools. But you can have similar errors also in other operating systems when the running user has a non-UTF-8 locale (e.g. set with LANG environment). The most common case is C (a standard locale, which is older then UTF-8, and it use just ASCII, because it must be very standard, it just support ASCII). This locale is mostly used by root, but modern operating system may use a UTF-8 version of C).

can't encode character '\u0144' even using encoding=utf-8 in python3

1 Answers1