0

I have a kmz file from the www and wish to read it into csv or such using pykml. The file is in UTF8, or at least it claims to - see header below. Reading it works, but triggers an error when coming on the first accented character.

<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns='http://www.opengis.net/kml/2.2'>
 <Document>
   <name>

from pykml import parser
with open(KMZFIL) as f:
 folder=parser.parse(f).getroot().Document.Folder
for pm in folder.Placemark:
 print(pm.name)

Ablitas (militar) (Emerg)
Ademuz (forestal)
Ager (PL%)
Alcala del Rio (ILIPA MAGNA)(Esc.)
Traceback (most recent call last):
  File "bin4/b21_xxxxxxx", line 15, in <module>
    print(pm.name)

grep "name" $INFIL | head -7
 ( ... )
   <name>Ablitas (militar) (Emerg)</name>
   <name>Ademuz (forestal)</name>
   <name>Ager (PL%)</name>
   <name>Alcala del Rio (ILIPA MAGNA)(Esc.)</name>
   <name>Ainzón</name>
CodeMonkey
  • 22,825
  • 4
  • 35
  • 75
Karel Adams
  • 185
  • 2
  • 19
  • Forgot to add what is actually in the input: grep Ainz $INFIL | hexdump -C 00000000 09 09 09 09 3c 6e 61 6d 65 3e 41 69 6e 7a c3 b3 |....Ainz..| 00000010 6e 3c 2f 6e 61 6d 65 3e 0a |n.| 00000019 – Karel Adams Oct 21 '16 at 15:45
  • Yes, that dump shows a UTF-8 character in the correct place. What version of Python are you using? – Mark Ransom Oct 21 '16 at 15:49
  • sorry, I should have added that too. python --version Python 2.7.6 – Karel Adams Oct 21 '16 at 15:50

2 Answers2

0

You need to open the file in a way that instructs Python to interpret the bytes as UTF-8 characters. In Python 2.7 you do it with the codecs module.

import codecs
with codecs.open(KMZFIL, encoding='utf-8') as f:

In Python 3 the encoding option has been added to the standard open so there's no need to use codecs.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Still no good, the same error at the same spot: Traceback (most recent call last): File "bin4/b21_xxxxxxx", line 16, in print(pm.name) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 4: ordinal not in range(128) Besides, I always understood UTF-8 is the default, for all python file operations, why should I need to refer to it explicitly? – Karel Adams Oct 21 '16 at 16:10
  • @KarelAdams no, UTF-8 is *not* the default - the default is given by [`sys.getdefaultencoding()`](https://docs.python.org/2/library/sys.html#sys.getdefaultencoding). And I missed the part that your problem was in *printing* the result, sorry. That's a totally different answer that I don't have at the top of my head but should be available here on-site somewhere. – Mark Ransom Oct 21 '16 at 16:19
  • Actually it is for me to be sorry, I did not originally indicate the error was in the print statement. Where does the default encoding come from? Is it taken from some LC environment variable at python startup? – Karel Adams Oct 21 '16 at 16:21
  • @KarelAdams you'll have to go searching for that information. I found one: https://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/ – Mark Ransom Oct 21 '16 at 16:30
  • Thanks, Mark, that was enough to find me a solution. Well, that is to say, I consider it a work-ariound rather, but it does do the job. Also, I understand the issue will only present itself on writing to sdtout, which I only do for verification/debugging; the actual code will write to a file, so it shouldn't see this issue anyway. import sys reload(sys) sys.setdefaultencoding('utf-8') – Karel Adams Oct 21 '16 at 16:43
0

Didn't see the answer here but these are lmxl StringElements -- I used .text to fix this error.

change print(pm.name) to print(pm.name.text)

https://lxml.de/api/lxml.objectify.StringElement-class.html

Weston
  • 11
  • 2