Cleaning CDATA in xml through xslt

Question

I am trying to transform RSS 2 coming from Wordpress into XHTML 1.0 Strict (using a cronjob and xsltproc); however, Wordpress inserts an img into the CDATA at the end of the summary element. The img has a border attribute, which is invalid in XHTML 1.0 Strict. Because it's CDATA, I assume that means I can't match it with my XSLT. I can say for certain that the img is always the last thing before the CDATA ends. I'd prefer to strip the border attr and keep the image, but I'd rather get rid of the element entirely than have invalid markup.

Is it possible to match inside CDATA using XSLT, perhaps using a string expression? If so, is that the right way to go here, or is there a better solution to be had?

First, there would be no elements, nor attributes, because that CDATA is just unparsed text. **Don't treat parseable data as unparsed data**. Second, every feed reader support Atom, wich treat mixed content properly. — , Mar 04 '11 at 20:25
similar problem has been discussed a while ago: http://stackoverflow.com/questions/5100482/xslt-rss-feed-combine-substring-before-and-substring-after — Alex Nikolaenkov, Mar 04 '11 at 20:42

score 3 · Accepted Answer · answered Mar 04 '11 at 21:14

3

Remember what CDATA means: "character data". Putting something in CDATA means: this might look like markup, but I don't want you to treat it as markup. So if that thing inside the CDATA looks like an img element, the CDATA is there to tell you not to be fooled - it's not an element at all. Having said that, you can of course process the text in the way you process any other character string, including passing it to an XML parser to be turned into a tree of nodes.

answered Mar 04 '11 at 21:14

Michael Kay

156,231
11
92
164

I wonder can `saxon:parse()` handle `CDATA` sections on the fly or should they be preprocessed by some other means? – Alex Nikolaenkov Mar 05 '11 at 05:51
Yes. saxon:parse() was invented to help people dig themselves out of exactly this hole - using CDATA to "hide" markup that should not have been hidden in the first place. – Michael Kay Mar 10 '11 at 14:29

score 1 · Answer 2 · answered Mar 04 '11 at 20:18

1

CDATA is merely a text node, you can match it with text() template. Then you can use string functions to remove border attr from the text.

answered Mar 04 '11 at 20:18

Alexey Ivanov

11,541
4
39
68

Cleaning CDATA in xml through xslt

2 Answers2