0

I have a source.xml file with structure like:

<products>
    <product>
        <id>1</id>
        <description>
            <style>
            table{
            some css here
            }
            </style>
            <descr>
            <div>name of producer like ABC&DEF</div>
            <table>
                <th>parameters</th>
                <tr><td>name of param 1 e.g POWER CONSUMPTION</td>
                    <td>value of param 1 with e.g < 100 W</td></tr>
            </table>
            </descr>
        </description>
    </product>
.....................
</products>

I would like to have:

<products>
    <product>
        <id>1</id>
        <description>
        <![CDATA[
            <style>
            table{
            some css here
            }
            </style>
            <descr>
            <div>name of producer like ABC&DEF</div>
            <table>
                <th>parameters</th>
                <tr><td>name of param 1 e.g POWER CONSUMPTION</td>
                    <td>value of param 1 with e.g < 100 VA</td></tr>
            </table>
        ]]>
            </descr>
        </description>
    </product>
.....................
</products>

I tried .xsl stylesheets based on: How to use in XSLT? and Add CDATA to an xml file and how to add cdata to an xml file using xsl such as:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />

<xsl:template match="/products">
    <products>
    <xsl:for-each select="product">
        <product>
            <description>
            <xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>
            <xsl:copy-of select="description/node()" />    
            <xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>
            </xsl:for-each>
            </description>
        </product>
    </xsl:for-each>
    </products>
</xsl:template>
</xsl:stylesheet>

and

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="xml" indent="yes" cdata-section-elements="description"/>

  <xsl:template match="description">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:variable name="subElementsText">
        <xsl:apply-templates select="node()" mode="asText"/>
      </xsl:variable>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text()" mode="asText">
    <xsl:copy/>
  </xsl:template>

  <xsl:template match="*" mode="asText">
    <xsl:value-of select="concat('&lt;',name())"/>
    <xsl:for-each select="@*">
      <xsl:value-of select="concat(' ',name(),'=&quot;',.,'&quot;')"/>
    </xsl:for-each>
    <xsl:value-of select="'&gt;'"/>
    <xsl:apply-templates select="node()" mode="asText"/>
    <xsl:value-of select="concat('&lt;/',name(),'&gt;')"/>
  </xsl:template>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

but running my python script

import lxml.etree as ET

doc = ET.parse('source.xml')
xslt = ET.parse('modyfi.xsl')
transform = ET.XSLT(xslt)
newdoc = transform(doc)
with open(f'output.xml', 'wb') as f:
    f.write(newdoc)

on SublimeText3 I allways get the same error:

lxml.etree.XMLSyntaxError: StartTag: invalid element name, {number of line and column with first appearance of illegal character}

I am sure, that solution is straight in front of me in links above, but I can't see it. Or maybe I can't find it because I can't ask the right question. Please help, I'm new to coding.

JWPB
  • 3
  • 4

2 Answers2

0

The input XML is not well-formed. I had to fix it first. That seems to be the reason why it is failing on your end.

XML

<products>
    <product>
        <id>1</id>
        <description>
            <style>table{
            some css here
            }</style>
            <descr>
                <div>name of producer like ABC&amp;DEF</div>
                <table>
                    <th>parameters</th>
                    <tr>
                        <td>name of param 1 e.g POWER CONSUMPTION</td>
                        <td>value of param 1 with e.g &lt; 100 W</td>
                    </tr>
                </table>
            </descr>
        </description>
    </product>
</products>

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="description">
        <xsl:copy>
            <xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>
            <xsl:copy-of select="*"/>
            <xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Output

<products>
  <product>
    <id>1</id>
    <description><![CDATA[
      <style>table{
            some css here
            }
      </style>
      <descr>
        <div>name of producer like ABC&amp;DEF</div>
        <table>
          <th>parameters</th>
          <tr>
            <td>name of param 1 e.g POWER CONSUMPTION</td>
            <td>value of param 1 with e.g &lt; 100 W</td>
          </tr>
        </table>
      </descr>]]>
    </description>
  </product>
</products>
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
  • Thank you. Does it mean that I can't have characters like '&' and '<' in CDATA section? They must be escaped even here? – JWPB Nov 17 '20 at 21:19
  • @JWPB, without them the input XML is not well-formed. And as such cannot be processed by the XSLT. – Yitzhak Khabinsky Nov 17 '20 at 21:22
  • Understand. So my desired output content posted in question is impossible to obtain. That explains why I fail. – JWPB Nov 17 '20 at 21:54
  • What is the right way to convert my xml file shown on the begining of my question info form, which you presented as fixed? I was able to transform my xml file like e.g `` and `&` into `<td>` and `&` so now everything inside `` element is escaped. But now I don't know how to "non-escape" html tags characters while leaving escaped characters inside text in those tags. – JWPB Nov 18 '20 at 11:10
0

In my view a clean way is to make use of a serialize function to serialize all elements you want as plain text, to then designate the parent container in the xsl:output declaration in the cdata-section-elements and to finally make sure the XSLT processor is in charge of the serialization.

Now XSLT 3 has a built-in XPath 3.1 serialize function, in Python you could use that with Saxon-C and its Python API.

For libxslt based XSLT 1 with lxml you can write an extension function in Python exposed to XSLT:

from lxml import etree as ET

def serialize(context, nodes):
    return b''.join(ET.tostring(node) for node in nodes)


ns = ET.FunctionNamespace('http://example.com/mf')
ns['serialize'] = serialize

xml = ET.fromstring('<root><div><p>foo</p><p>bar</p></div></root>')

xsl = ET.fromstring('''<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:mf="http://example.com/mf" version="1.0">
  <xsl:output method="xml" cdata-section-elements="div" encoding="UTF-8"/>
  <xsl:template match="@* | node()">
    <xsl:copy>
       <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="div">
    <xsl:copy>
      <xsl:value-of select="mf:serialize(node())"/>
    </xsl:copy>
 </xsl:template>
</xsl:stylesheet>''')

transform = ET.XSLT(xsl)

result = transform(xml)

result.write_output("transformed.xml")

Output then is

<?xml version="1.0" encoding="UTF-8"?>
<root><div><![CDATA[<p>foo</p><p>bar</p>]]></div></root>
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Thank you @Martin Honnen. I think I need a few days to understand what you wrote. I will try. – JWPB Nov 17 '20 at 21:52