0

I have an XML document that contains a "body" element which contains xhtml. I'm trying to process that html in order to remove some non-standard tags. No namespaces are used in the source xml document.

The XML looks like this:

<article>
  <body>
     <p>Paragraph 1</p>
     <p>Paragraph 2</p>
     <p>Paragraph 3 <fig></fig></p>
  </body>
</article>

The XSLT looks like this:

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="p">
  <![CDATA[<div>HIT A P</div>]]>
  <xsl:apply-templates mode="copy" select="@*|node()"/>
</xsl:template>
</xsl:stylesheet>

The output is this - and I don't get why it's only finding the first p tag:

<div>HIT A P</div>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3 <fig></fig></p>

Any idea why the p template only gets fired the first time rather than for all 3 paragraphs??

I'm also trying to figure out why adding this isn't causing the "fig" elements to be removed:

<xsl:template match="fig" />

Thanks for taking the time to help me out.

UPDATE: Thank you so much for the reply. I was trying to oversimplify the issue. What I'm really doing is two XSLT processes - one to get the data organized into a standard format and a 2nd XSLT process that looks at the HTML within the body and copies everything except certain non-standard tags.

I think the problem I'm having is that after the first XSLT process, the HTML within the body is htmlencoded, and it seems that the 2nd XSLT process isn't able to transform the HTML. Here's a better example of what is really happening:

This is the new XML (which is the result of an earlier xslt transformation - and as a result the text is encoded):

<document>
    <article>
        <title>SAMPLE TITLE</title>
        <bodytext>
          &lt;p&gt;Paragraph 1&lt;/p&gt;
          &lt;p&gt;Paragraph 2&lt;/p&gt;
          &lt;p&gt;Paragraph 3&lt;/p&gt;
          &lt;p&gt;
          Paragraph 4 - contains non-standard fig tag
          &lt;fig&gt;
          &lt;graphic href="testgraphic.jpg"/&gt;
          &lt;/fig&gt;
          &lt;/p&gt;
        </bodytext>
    </article>
</document>

Here is the new XSLT:

<xsl:output method="html" encoding="utf-8" indent="yes"/>

    <xsl:template match="@*|node()">
    <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
    </xsl:template>

    <xsl:template match="p">
    <![CDATA[<div>HIT A P</div>]]>
    <xsl:apply-templates mode="copy" select="@*|node()"/>
    </xsl:template>


  <xsl:template match="bodytext">
      <![CDATA[<div>HELLO FROM BODYTEXT</div>]]>
    <xsl:element name="bodytext">
      <xsl:apply-templates />
    </xsl:element>
   </xsl:template>



    <!-- THIS APPEARS TO NEVER GET HIT -->
    <xsl:template match="fig" />


</xsl:stylesheet>

When I run that, I get the following:

<document>
    <article>
                <title>SAMPLE TITLE</title>

                &lt;div&gt;HELLO FROM BODYTEXT&lt;/div&gt;<bodytext>

                &lt;p&gt;Paragraph 1&lt;/p&gt;
                &lt;p&gt;Paragraph 2&lt;/p&gt;
                &lt;p&gt;Paragraph 3&lt;/p&gt;
                &lt;p&gt;
                Paragraph 4 - contains non-standard fig tag
                &lt;fig&gt;
                &lt;graphic href="testgraphic.jpg"/&gt;
                &lt;/fig&gt;
                &lt;/p&gt;

                </bodytext>
        </article>
</document>

In this example, it isn't able to process each paragraph and remove the fig. However, if the XML isn't htmlencoded, it works. Here's the working XML:

<document>
    <article>
        <title>SAMPLE TITLE</title>
        <bodytext>
            <p>Paragraph 1</p>
            <p>Paragraph 2</p>
            <p>Paragraph 3 <fig></fig></p>
        </bodytext>
    </article>
</document>

And this is the output:

<document>
    <article>
                <title>SAMPLE TITLE</title>

                &lt;div&gt;HELLO FROM BODYTEXT&lt;/div&gt;<bodytext>


        &lt;div&gt;HIT A P&lt;/div&gt;Paragraph 1
     &lt;div&gt;HIT A P&lt;/div&gt;Paragraph 2
     &lt;div&gt;HIT A P&lt;/div&gt;Paragraph 3


                </bodytext>
        </article>
</document>

Do you know how I can do that 2nd process when the incoming data is htmlencoded? Thanks again.

Erich
  • 499
  • 1
  • 13
  • 34
  • There's nothing strange here. What you show is not *htmlencoded*, but escaped XML. Escaped XML is not XML - see: http://stackoverflow.com/questions/27018244/apply-transforms-to-xml-attribute-containing-escaped-html/27019850#27019850 – michael.hor257k Oct 07 '15 at 19:05

1 Answers1

2

Running your XSLT against your provided input XML, I don't get your unexpected output. I get this output,

<article>

   <body>

          &lt;div&gt;HIT A P&lt;/div&gt;
          Paragraph 1

          &lt;div&gt;HIT A P&lt;/div&gt;
          Paragraph 2

          &lt;div&gt;HIT A P&lt;/div&gt;
          Paragraph 3 

   </body>

</article>

which is exactly what your XSLT should be generating.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • As @michael.hor257k commented on your question, you cannot expect `` to match against escaped XML, which, by its very nature, tells parsers not to consider it to be XML markup -- that's what ***escaped*** XML means. – kjhughes Oct 07 '15 at 19:21
  • So would the only way around that in PHP to be to introduce a separate process to create a new DOMDocument and load the unescaped content into it? – Erich Oct 07 '15 at 20:29
  • I just discovered disable-output-escaping="yes" - that seems to solve the problem by keeping it unescaped. Thank you all for your help. – Erich Oct 07 '15 at 20:43
  • And using copy-of instead of value-of is probably a step in the right direction. – Erich Oct 07 '15 at 20:55
  • I'd add reviewing your purpose for using CDATA to your list of todos. – kjhughes Oct 07 '15 at 21:47