Reading CDATA with lxml, problem with end of line

Question

Hello I am parsing a xml document with contains bunch of CDATA sections. I was working with no problems till now. I realised that when I am reading the an element and getting the text abribute I am getting end of line characters at the beggining and also at the end of the text read it.

A piece of the important code as follow:

for comments in self.xml.iter("Comments"):
    for comment in comments.iter("Comment"):
        description = comment.get('Description')

        if language == "Arab":
            tag = self.name + description
            text = comment.text

The problem is at element Comment, he is made it as follow:

<Comment>
<![CDATA[Usually made it with not reason]]>

I try to get the text atribute and I am getting like that:

\nUsually made it with not reason\n

I Know that I could do a strip and so on. But I would like to fix the problem from the root cause, and maybe there is some option before to parse with elementree.

When I am parsing the xml file I am doing like that:

tree = ET.parse(xml)

Minimal reproducible example

import xml.etree.ElementTree as ET

filename = test.xml  #Place here your path test xml file

tree = ET.parse(filename)
root = tree.getroot()
Description = root[0]
text = Description.text

print (text)

Minimal xml file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>

Let me check if I can create an isolate minimal code. I have a lot of dependencies. — Jmm86, Jan 12 '21 at 14:49

Tomalak · Accepted Answer · 2021-01-12T16:17:51.420

2

You're getting newline characters because there are newline characters:

<Comment>
<![CDATA[Usually made it with not reason]]>
</Comment>

Why else would <![CDATA and </Comment start on new lines?

If you don't want newline characters, remove them:

<Comment><![CDATA[Usually made it with not reason]]></Comment>

Everything inside an element counts towards its string value.

<![CDATA[...]]> is not an element, it's a parser flag. It changes how the XML parser is reading the enclosed characters. You can have multiple CDATA sections in the same element, switching between "regular mode" and "cdata mode" at will:

<Comment>normal text <![CDATA[
    CDATA mode, this may contain <unescaped> Characters!
]]> now normal text again
<![CDATA[more special text]]> now normal text again
</Comment>

Any newlines before and after a CDATA section count towards the "normal text" section. When the parser reads this, it will create one long string consisting of the individual parts:

normal text 
    CDATA mode, this may contain <unescaped> Characters!
 now normal text again
more special text now normal text again

edited Jan 12 '21 at 16:17

answered Jan 12 '21 at 15:51

Tomalak

332,285
67
532
628

This how the xml file was made. I can not remove it I am just reading it, anyway I will implement a strip built in function. – Jmm86 Jan 12 '21 at 16:03
1

@Jmm86 Yes, if you don't care for the newlines, strip them out. That's XML 101, not every whitespace is significant. But they are all retained, because they *could* be significant. – Tomalak Jan 12 '21 at 16:08
1

@Jmm86 Also note the addition to the answer. Might explain why you thought the newlines were not there. – Tomalak Jan 12 '21 at 16:08
Thank you so much. – Jmm86 Jan 12 '21 at 16:09

Jmm86 · Answer 2 · 2021-01-12T16:14:04.183

I thought that when CDATA comes at xml they were coming with end of line at the beginning and at the end, like that.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>

But you can have it like that also.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description><![CDATA[Hello world]]></Description>

It is the reason to get end of line characters when we are parsing the with the Elementtree library, is working perfect in both cases, you only have to strip or not strip depending how you want to process the data.

if you want to remove both '\n' just add the following code:

text = Description.text
text = text.strip('\n')

Reading CDATA with lxml, problem with end of line

2 Answers2