2

I've got a .txt file that I want to read with Python and it contains Polish citynames. I use this code (my script has :# - coding: utf-8 -*- in the first line):

string='PL.txt'
country=io.open(string,mode=r, encoding='utf-8')
lezer=csv.reader(country,dialect='excel-tab')
my_dict=defaultdict(list)
for record in lezer:
   pc, gemeente= record[0], record[1]
   my_dict[pc].append(gemeente)
 return my_dict

When I use the code it starts running and then the error appears: returm codecs.charmap_encode(input,errors,encodeing_table) UnicodeEncodeError: charmap codec can't encode character u\'u0144' in position 35:charcter maps to

I've searched on the internet and I've found different answers bus not exact the one I need. It's about the character ń when I understand well. The basic codes charmap doesn't contain this character, so it can't be encoded. I used another codec utf16 but then it maps to something strange. I also tried other codes like latin-1, cp437, cp1252.

I also tried:

string='PL.txt'
country=io.open(string,mode=r, encoding='utf-8')
lezer=csv.reader(country,dialect='excel-tab')
my_dict=defaultdict(list)
for record in lezer:
   pc, gemeente= record[0], record[1].encode('utf16')
   my_dict[pc].append(gemeente)
 return my_dict

when I look with type(record[1]) is gives str and not unicode. It's the same with other Polish carachters.

  • Polish language uses characters that are not present in either ISO Latin-1 nor Microsoft CP-1252; try ISO Latin-2 or Microsoft CP-1250 instead. – Błotosmętek Jun 23 '17 at 08:31
  • There is no error with the character `ń` in a UTF-8-encoded text file. There is only one explanation - your `PL.txt` file is not UTF-8. – Tomalak Jun 23 '17 at 09:10
  • 1
    Dump the file contents in binary mode (`print(io.open(r'bla.txt', 'rb').read())`) and tell the bytes it uses for `ń`. Correct would be the byte sequence `\xc5\x84` - [LATIN SMALL LETTER N WITH ACUTE](http://www.fileformat.info/info/unicode/char/0144/index.htm). – Tomalak Jun 23 '17 at 09:15
  • Checkout unicodedata.normalize(form, unistr) https://docs.python.org/2/library/unicodedata.html – Rolf of Saxony Jun 23 '17 at 09:18
  • 1
    `#coding:utf8` declares *source code* encoding. The source code displayed has no non-ASCII characters in it, so it has no effect. Please post code that reproduces the error. The code as posted has a syntax error: `mode=r` should be `mode='r'`. A `UnicodeEncodeError` normally occurs during writing or printing which the code as shown doesn't do. Please post a [minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) that reproduces the problem and give the full error traceback. – Mark Tolonen Jun 24 '17 at 05:51
  • @Tomalak the text file must be UTF-8, because the error was on encoding, not decoding. In fact, the example error has Unicode character `\u0144` which is `ń`, so it was decoded correctly. Most likely, we aren't seeing the code producing the error. – Mark Tolonen Jun 24 '17 at 05:57
  • @Mark I was scratching my head about that as well. The sample and the description are at odds. – Tomalak Jun 24 '17 at 06:19
  • @Tomalak@Mark This was still the code producing the error. When I looked better to my inout file I saw that there were characters as ó that had to be ó (Google) and Å› that had to be ś. I think that encode and decode can't help. So I changed my code and wrote: pc, gemeente= record[0], record[1].encode('ascii','ignore'). – Helma Schapendonk Jun 27 '17 at 18:21
  • So I changed my code and wrote: pc, gemeente= record[0], record[1].encode('ascii','ignore'). But still I get the same error. Now it seems that the ignore doens't work. When I use it outside the loop the code works fine: 'Belsk Duży'.encode('ascii','ignore') gives: 'Belsk Duy' – Helma Schapendonk Jun 27 '17 at 18:45

0 Answers0