2

I learn python (2.7 version) and i have task to check the xml document by xsd schema using lxml library (http://lxml.de/). I have two files - examples like these:

$ cat 1.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE yml_catalog SYSTEM "shops.dtd">
<a>
  <b>Привет мир!</b>
</a>

and

$cat 2.xsd
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="a" type="AType"/>
  <xs:complexType name="AType">
    <xs:sequence>
      <xs:element name="b" type="xs:decimal" />
   </xs:sequence>
  </xs:complexType>
</xs:schema>

It should be very simple, but i don't understand how to use lxml with utf-8 (never working with codings hard). I do simple steps:

>>> from lxml import etree
>>> schema = etree.parse("/tmp/qwerty/2.xsd")
>>> xmlschema = etree.XMLSchema(schema)
>>> try:
    document = etree.parse("/tmp/qwerty/1.xml")
    print "Parse complete!"
except etree.XMLSyntaxError, e:
    print e

Parse complete!
>>> xmlschema.validate(document)
False
>>> xmlschema.error_log

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    xmlschema.error_log
  File "xmlerror.pxi", line 286, in lxml.etree._ListErrorLog.__repr__ (src/lxml/lxml.etree.c:33216)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 85-90: ordinal not in range(128)

And i cannot get all raised exceptions from .error_log.

Have any workaround with encode/decode methods to check it at all (with success) or maybe solution (and without another library (i talk about standard python methods)), or maybe i need to use StringIO (but how)?

I understand that my problem deprnds on "Привет мир!" and xs:decimal - these are only examples (short). Sorry for my English. Thank you.

WGS
  • 13,969
  • 4
  • 48
  • 51
dmgl
  • 267
  • 5
  • 12
  • With another language and string "Hello world!", not "Привет мир!" no problem: >>> xmlschema.error_log /tmp/qwerty/1.xml:4:0:ERROR:SCHEMASV:SCHEMAV_CVC_DATATYPE_VALID_1_2_1: Element 'b': 'Hello world!' is not a valid value of the atomic type 'xs:decimal'. – dmgl Apr 10 '14 at 22:18

1 Answers1

5

You have to encode the error messages in your error log using utf-8. Try the following:

Code:

from lxml import etree

schema = etree.parse("2.xsd")
xmlschema = etree.XMLSchema(schema)

try:
    document = etree.parse("1.xml")
    print "Parse complete!"
except etree.XMLSyntaxError, e:
    print e

print xmlschema.validate(document)
for error in xmlschema.error_log:
    print "ERROR ON LINE %s: %s" % (error.line, error.message.encode("utf-8"))

Result:

Parse complete!
False
ERROR ON LINE 4: Element 'b': 'Привет мир!' is not a valid value of the atomic type 'xs:decimal'.
[Finished in 1.3s]

Relevant documentation can be found here.

Let us know if this helps.

WGS
  • 13,969
  • 4
  • 48
  • 51
  • It's not stupid, tbh. However, the best thing to check if you have problems is the documentation first, as I checked the documentation right away after reading your problem. :) You are welcome and good luck! – WGS Apr 11 '14 at 13:51