4

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:

xmldoc = open(filename)

parser = etree.XMLParser(dtd_validation=True, load_dtd=True)    
tree = etree.parse(xmldoc, parser)

This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form.

Loading the xml file works fine:

tree = etree.parse(StringIO(data['xml_file']) 

But as the DTD is linked to in the top of the xml file, the following statement fails:

parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(StringIO(data['xml_file'], parser)

Via this question, I tried:

etree.DTD(StringIO(data['dtd_file'])
tree = etree.parse(StringIO(data['xml_file'])

Whilst the first line doesn't cause an error, the second falls over on unicode entities the DTD is meant to pick up (and does so in the file system version):

XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46

How do I go about correctly loading this DTD?

Community
  • 1
  • 1
Jon Hadley
  • 5,196
  • 8
  • 41
  • 65

2 Answers2

5

Here's a short but complete example, using the custom resolver technique @Steven mentioned.

from StringIO import StringIO
from lxml import etree

data = dict(
    xml_file = '''<?xml version="1.0"?>
<!DOCTYPE x SYSTEM "a.dtd">
<x><y>&eacute;zz</y></x>
''',
    dtd_file = '''<!ENTITY eacute "&#233;">
<!ELEMENT x (y)>
<!ELEMENT y (#PCDATA)>
''')

class DTDResolver(etree.Resolver):
     def resolve(self, url, id, context):
         return self.resolve_string(data['dtd_file'], context)

xmldoc = StringIO(data['xml_file'])
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
parser.resolvers.add(DTDResolver())
try:
    tree = etree.parse(xmldoc, parser)
except etree.XMLSyntaxError as e:
    # handle xml and validation errors
snapshoe
  • 13,454
  • 1
  • 24
  • 28
  • Your example works in isolation, but I'm using StringIO objects for both the xml_file and dtd_file - I can'y get it working with those.... – Jon Hadley Nov 16 '10 at 22:49
  • I've tried setting data = dict(xml_file= xml_file.read(),dtd_file= dtd_file.read()) ... which (seems) to get me further in, but now I have - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 28: ordinal not in range(128) - I thought the DTD would pick up on this..... – Jon Hadley Nov 16 '10 at 23:11
  • Correction - UnicodeEncodeError was unrelated to your code. dtd_file.read() etc seems to have done the job, although I'm unsure if it's the best approach. – Jon Hadley Nov 16 '10 at 23:51
1

You could probably use a custom resolver. The docs actually give an example of doing this to provide a dtd.

Steven
  • 28,002
  • 5
  • 61
  • 51