17

Here is my XML:

<doc xmlns="http://www.foo.org">
  <div>
    <title>Mr. Title</title>
    <paragraph>This is one paragraph.
    </paragraph>
    <paragraph>Another paragraph.
    </paragraph>
    <list>
      <orderedlist>
        <item>
          <paragraph>An item paragraph.</paragraph>
        </item>
        <item>
          <paragraph>Another item paragraph</paragraph>
        </item>
      </orderedlist>
    </list>
  </div>    
</doc>

Here is my XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:foo="http://www.foo.org">

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="foo:doc">
  <xsl:element name="newdoc" namespace="http://www/w3.org/1999/xhtml">
   <xsl:apply-templates/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="foo:div">
  <segment title="{foo:title}">
   <xsl:apply-templates/>
  </segment>
 </xsl:template>

 <xsl:template match="foo:title">
  <xsl:element name="h2">
   <xsl:apply-templates/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="foo:paragraph">
  <xsl:element name="p">
   <xsl:apply-templates/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="foo:list">
  <xsl:apply-templates/>
 </xsl:template>

 <xsl:template match="foo:orderedlist">
  <xsl:element name="ol">
   <xsl:apply-templates/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="foo:item">
  <xsl:element name="li">
   <xsl:apply-templates/>
  </xsl:element>
 </xsl:template>

 <xsl:template match="foo:item/foo:paragraph">
  <xsl:apply-templates/>
 </xsl:template>

</xsl:stylesheet>

And the output:

<newdoc xmlns="http://www/w3.org/1999/xhtml">
  <segment xmlns="" title="Mr. Title">
    <h2>Mr. Title</h2>
    <p>This is one paragraph.
    </p>
    <p>Another paragraph.
    </p>

      <ol>
        <li>
          An item paragraph.
        </li>

        <li>
          Another item paragraph
        </li>
      </ol>

  </segment>    
</newdoc>

I would like to change 3 things about this output:

  1. remove the line break from the "p" elements (originally paragraph)
  2. remove the line breaks from the "li" elements (produced when item/paragraph elements were removed)
  3. remove the extra blank lines created when the list items were removed

-I have tried <xsl:template match="foo:list/text()[normalize-space(.)='']" /> for #3, but this messes with the indentation

-I have also tried <xsl:template match="foo:paragraph/text()[normalize-space(.)='']" /> for #1, but this has no effect on the line breaks

-And I have tried <xsl:strip-space elements="*"/> but this eliminates all indentation

Thank you!!

Zori
  • 277
  • 2
  • 3
  • 7
  • 4
    You wrote _"I have tried `` but this eliminates all indentation"_ That's true **only for input sources**. And by the way, it solves problems 2 and 3. For addressing 1 you need ` ` as suggested in @Mads Hansen's answer. –  Apr 21 '11 at 02:27
  • Good question again, +1. See my answer for a short and easy solution. :) – Dimitre Novatchev Apr 21 '11 at 02:42

2 Answers2

16

Adding these templates to your stylesheet:

<xsl:template match="*/text()[normalize-space()]">
    <xsl:value-of select="normalize-space()"/>
</xsl:template>

<xsl:template match="*/text()[not(normalize-space())]" />

Produces this output:

<?xml version="1.0" encoding="UTF-8"?>
<newdoc xmlns="http://www/w3.org/1999/xhtml">
    <segment xmlns="" xmlns:foo="http://www.example.com" title="Mr. Title">
        <h2>Mr. Title</h2>
        <p>This is one paragraph.</p>
        <p>Another paragraph.</p>
        <ol>
            <li>An item paragraph.</li>
            <li>Another item paragraph</li>
        </ol>
    </segment>
</newdoc>

The template with match="*/text()[normalize-space()]" will match text() nodes if the string returned from normalize-space() has some value. An empty string from an all whites-space text() would evaluate to false() and not be matched. The other template matches the opposite condition, and since it is an empty template, will eliminate the white-space only text() from the output.

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
  • That fixed #1 for me, thanks! However, I didn't get the same output as you... I still get the line breaks in "li" and blank lines before and after the "ol" elements. Seems to be wherever an element was removed remains unaffected. Is there maybe a better way to remove elements that doesn't leave space behind? – Zori Apr 21 '11 at 01:19
  • Hmm, when I ran it through Xselerator using AltovaXML, it generated that output that I posted that didn't have the whitespace lines, but I did get them running through Saxon. I've added another template matching for whitespace-only `text()` nodes that produces the posted output and should address your other items. – Mads Hansen Apr 21 '11 at 02:02
  • 1
    _"using AltovaXML"_ means that whitespace only text nodes are stripped, the same as using `xsl:strip-space elements="*"`. –  Apr 21 '11 at 02:22
  • Adding the `` line eliminates indenting (throws everything into one line) just as `xsl:strip-space elements="*"` did. BTW, I am using Firefox's XSLT. – Zori Apr 21 '11 at 04:03
  • Scratch that. Looks like Firefox's default processor was taking out the formatting. I installed another processor and all of the answers work great. Sorry for the trouble and thank you for the help!! – Zori Apr 21 '11 at 04:47
  • 1
    An interesting side effect of adding this is that (in my case at least) the process now runs much faster (with xsltproc): down from 20s to 0.1s (in my case, a hand-written filter turning 36k lines of HTML into LaTeX). I was sufficiently surprised that I double-checked the timings! – Reuben Thomas Nov 05 '17 at 13:31
  • Can anyone maybe explain how these templates work or point me in the right direction? In particular I do not understand what the `[normalize-space()]` part of the match rule means. – T-Dawg Apr 07 '22 at 10:16
  • 1
    @T-Dawg I'll add some context to the answer, but in short the `[]` is a predicate that functions sort of like a `WHERE` clause in SQL filtering all of the matched items and keeping only those that return `true()` for the expression inside. `normalize-space()` collapses all whitespace characters and eliminates leading/trailing whitespace. If any characters are left, they return `true()`, if empty (because was all whitespace) then would evaluate to `false()` and be filtered out. – Mads Hansen Apr 07 '22 at 12:45
8

At the very end of the stylesheet add these two templates:

<xsl:template match=
"text()[not(string-length(normalize-space()))]"/>

<xsl:template match=
"text()[string-length(normalize-space()) > 0]">
  <xsl:value-of select="translate(.,'&#xA;&#xD;', '  ')"/>
</xsl:template>

You now get the wanted result:

<?xml version="1.0" encoding="UTF-8"?>
<newdoc xmlns="http://www/w3.org/1999/xhtml">
   <segment xmlns="" xmlns:foo="http://www.foo.org" title="Mr. Title">
      <h2>Mr. Title</h2>
      <p>This is one paragraph.         </p>
      <p>Another paragraph.         </p>
      <ol>
         <li>An item paragraph.</li>
         <li>Another item paragraph</li>
      </ol>
   </segment>
</newdoc>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • This removes all indenting (throws everything into one line) as well. Is this because I'm using the Firefox XSLT? – Zori Apr 21 '11 at 03:59
  • 2
    @Zori: I have run the transformation with 9 different processors -- none of them loses indentation. I don't have the FF XSLT processor. :( – Dimitre Novatchev Apr 21 '11 at 04:17
  • Sure enough, the Firefox's default processor was taking out the formatting. I installed another processor and all of the answers work great. Sorry for the trouble and thank you for the help!! – Zori Apr 21 '11 at 04:45
  • @Zori: Glad that at the end you found the cause of the problem. You could consider now accepting one of the answers :) – Dimitre Novatchev Apr 21 '11 at 12:26