lxml/python reading xml with CDATA section

Question

In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?

Default does not work:

$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[&#xE9;]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:

$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text    
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

Using strip_cdata=True, which should be the default, yields the same:

$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)    
$ tree.getroot().text    
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

If you add enough of the relevant XML, we might able to test. — Jongware, Nov 23 '18 at 23:19
Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them. — Jongware, Nov 23 '18 at 23:28

score 1 · Accepted Answer · answered Nov 24 '18 at 07:02

1

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 07:02

mzjn

48,958
13
128
248

Thanks for that. I read that part, but did not realise `etree.tostring` serialises. – Sudipta Basak Nov 24 '18 at 14:26

lxml/python reading xml with CDATA section

1 Answers1

Linked