2

I'm parsing an xml file using the code below:

import lxml

file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)

from lxml import etree

parser = lxml.etree.XMLParser()


tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()

nsmap = {'xmlns': 'urn:tva:metadata:2010'} 


with open(file_name+'.log', 'w', encoding='utf-8') as f:
    for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
       crid = (info.get('programId'))
       titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
       title = (titlex.text if titlex != None else 'Missing')
       synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))             
       synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')               
       synopsis1 = synopsis1.replace('\r','').replace('\n','')
       f.write('{}|{}|{}\n'.format(crid, title, synopsis1))    

Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡ité bydlení'.

I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.

How can I configure the written output to be human readable?

SteveC
  • 15,808
  • 23
  • 102
  • 173
Nick
  • 141
  • 11

2 Answers2

2

I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.

import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Jonathan
  • 8,453
  • 9
  • 51
  • 74
  • Just to make it more obvious to other people, the \4500 within the quotes is needed to make it work. I didn't need / use the f.seek(0) in my application. – Nick Apr 03 '14 at 16:24
  • I spoke a bit too soon. I'm seen a lot of '?' in the text, where it can't display a character. For example, it is outputting 'nikdo a nic nám neute?e' instead of 'nikdo a nic nám neuteče'. Any ideas? – Nick Apr 03 '14 at 16:33
  • `\u4500` is a Han Character and has nothing to do with the OP's question. This code block is copied from the Python manual and is irrelevant – Alastair McCormack Apr 03 '14 at 20:42
0

Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.

This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • It appears to be a problem with 'Textpad', which doesn't properly support UTF-8 (only used because it's great for opening really large log files). It opens correctly in Notepad++ and all is good in the world. – Nick Apr 04 '14 at 11:54