How to extract CDATA without the GPath/node name

Question

I'm trying to extract CDATA content from an XML without the using GPath (or) node name. In short, i want to find & retrieve the innerText containing CDATA section from an XML.

My XML look like:

def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <Test1>This node contains some innerText. Ignore This.</Test1>
    <Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
</root>'''

From the above XML, i want to get the CDATA content alone without using the reference of its node name 'Test2'. Because the node name is not always the same in my scenario.

Also note that the XML can contain innerText in few other nodes (Test1). I dont want to retrieve that. I just need the CDATA content out of the whole XML.

I want something like below (the code below is incorrect though)

def parsedXML = new xmlSlurper().parseText(xml)
def cdataContent = parsedXML.depthFirst().findAll { it.text().startsWith('<![CDATA')}

My output should be :

this is the CDATA section i want to retrieve

with groovy xml parser you can't detect cdata. you have to use DOM or other xml parser. — daggett, Sep 17 '18 at 21:37

score 1 · Accepted Answer · answered Sep 18 '18 at 09:17

As @daggett says, you can't do this with the Groovy slurper or parser, but it's not too bad to drop down and use the java classes to get it.

Note you have to set the property for CDATA to become visible, as by default it's just treated as characters.

Here's the code:

import javax.xml.stream.*

def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
    <Test1>This node contains some innerText. Ignore This.</Test1>
    <Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
</root>'''

def factory = XMLInputFactory.newInstance()
factory.setProperty('http://java.sun.com/xml/stream/properties/report-cdata-event', true)

def reader = factory.createXMLStreamReader(new StringReader(xml))
while (reader.hasNext()) {
    if (reader.eventType in [XMLStreamConstants.CDATA]) {
        println reader.text
    }
    reader.next()
}

That will print this is the CDATA section i want to retrieve

This is perfect & reliable solution. Thanks!! – user1523153 Sep 24 '18 at 15:49 — user1523153, Sep 24 '18 at 15:49

score 1 · Answer 2 · answered Sep 20 '18 at 03:20

1

Considering you just have one CDATA in your xml split can help here

def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<Test1>This node contains some innerText. Ignore This.</Test1>
<Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
 </root>'''

 log.info xml.split("<!\\[CDATA\\[")[1].split("]]")[0]

So in the above logic we split the string on CDATA start and pick the portion which is left after

xml.split("<!\\[CDATA\\[")[1]

and once we got that portion we did the split again and then got the portion which is before that pattern by using

.split("]]")[0]

Here is the proof it works

answered Sep 20 '18 at 03:20

Gaurav Khurana

3,423
2
29
38

1

Hmmm... Not a fan of this, string manipulation shouldn't be encouraged to parse XML... – tim_yates Sep 20 '18 at 21:15
true its not very logical but in some custom situation, it appears to be the best solution. – Gaurav Khurana Sep 21 '18 at 03:12

How to extract CDATA without the GPath/node name

2 Answers2