0

I have an XML file that I'm reading using python's lxml.objectify library.

I'm not finding a way of getting the contents of an XML comment:

<data>
  <!--Contents 1-->
  <some_empty_tag/>
  <!--Contents 2-->
</data>

I'm able to retrieve the comment (is there a better way? xml.comment[1] does not seem to work):

xml = objectify.parse(the_xml_file).getroot()
for c in xml.iterchildren(tag=etree.Comment):
   print c.???? # how do i print the contets of the comment?
   # print c.text  # does not work
   # print str(c)  # also does not work

What is the correct way?

RedX
  • 14,749
  • 1
  • 53
  • 76
  • I wouldn't expect to be able to parse comments with an xml library; by definition they aren't part of the xml structure and can always be ignored by any tool – Daenyth Sep 08 '16 at 12:17

1 Answers1

1

You just need to convert the child back to string to extract the comments, like this:

In [1]: from lxml import etree, objectify

In [2]: tree = objectify.fromstring("""<data>
   ...:   <!--Contents 1-->
   ...:   <some_empty_tag/>
   ...:   <!--Contents 2-->
   ...: </data>""")

In [3]: for node in tree.iterchildren(etree.Comment):
   ...:     print(etree.tostring(node))
   ...:
b'<!--Contents 1-->'
b'<!--Contents 2-->'

Of course you may want to strip the unwanted wrapping.

Anzel
  • 19,825
  • 5
  • 51
  • 52
  • I ended up going with this way but it seems more like a hack to me then a real solution. – RedX Sep 09 '16 at 08:55
  • @RedX, it seems like a hack surely but it isn't. Think about `` block has no proper xml/html attribute settings, and the only rule/way to parse the text content is to render as is, well at least for lxml anyway. – Anzel Sep 09 '16 at 10:09
  • I was expecting to be able to just use something like `contents`, `raw`, `text` or any other function to get the contents. I mean it is just text (AFAIK). – RedX Sep 09 '16 at 10:25
  • @RedX, well I have to agree to disagree, like I said above, within a `comment` element, yes it has `tag` name property which inherit from its base class as "comment", but apart from this it really doesn't contain any inline attributes or block content. So AFAIK this node has not text content because it's self-closing, which also means only legit property is `tail` -- text between itself and next element – Anzel Sep 09 '16 at 10:54