0

Using XSLT 2.0, how can I dynamically process an XML document to remove nodes that have the xml:lang attribute, based on the rules/requirements below?

Requirements:

  • Find any node (and it's same-type immediate siblings) with attribute xml:lang
  • Know that xml:lang values have a 3-tier hierarchy based on language/locale, with non-exhaustive examples below:
    1. x-default (tier 1, highest)
    2. en (tier 2, language prefix, other value examples: fr, es, ru)
    3. en-US (tier 3, language prefix followed by suffix, other value examples: en-GB, en-CA)
  • Based on the known hierarchy, duplicate values should be removed.
  • When removing duplicates, also take into consideration the existence of additional attributes that a sibling may have.
  • Leave the rest of the XML document unmolested

Example dataset:

<?xml version="1.0" encoding="UTF-8"?>
<arbitrarydepth>
<scenario1 xml:lang="x-default">A Default Node Value</scenario1>
<scenario1 xml:lang="en">A Default Node Value</scenario1>
<scenario1 xml:lang="en-US">A Default Node Value</scenario1>

<scenario2 xml:lang="x-default">The orig value</scenario2>
<scenario2 xml:lang="en">The orig value</scenario2>
<scenario2 xml:lang="en-US">A new value</scenario2>

<scenario3 xml:lang="x-default">The orig value</scenario3>
<scenario3 xml:lang="en">A new value</scenario3>
<scenario3 xml:lang="en-US">The orig value</scenario3>

<scenario4 xml:lang="x-default">The orig value</scenario4>
<scenario4 xml:lang="en">An english value</scenario4>
<scenario4 xml:lang="en-US">An english US value</scenario4>
<scenario4 xml:lang="fr">A french value</scenario4>
<scenario4 xml:lang="fr-FR">A french value</scenario4>
<scenario4 xml:lang="fr-CA">A french Canada value</scenario4>

<scenario5 xml:lang="x-default" attr0="something here">The orig value</scenario5>
<scenario5 xml:lang="en" attr1="Some attribute">The orig value</scenario5>
<scenario5 xml:lang="en-US" attr2="some other attribute">The orig value</scenario5>
<scenario5 xml:lang="fr" attr0="something here">The orig value</scenario5>
<scenario5 xml:lang="fr-FR">The orig value</scenario5>
</arbitrarydepth>

Example resultset:

<?xml version="1.0" encoding="UTF-8"?>
<arbitrarydepth>
<scenario1 xml:lang="x-default">A Default Node Value</scenario1>

<scenario2 xml:lang="x-default">The orig value</scenario2>
<scenario2 xml:lang="en-US">A new value</scenario2>

<scenario3 xml:lang="x-default">The orig value</scenario3>
<scenario3 xml:lang="en">A new value</scenario3>
<scenario3 xml:lang="en-US">The orig value</scenario3>

<scenario4 xml:lang="x-default">The orig value</scenario4>
<scenario4 xml:lang="en">An english value</scenario4>
<scenario4 xml:lang="en-US">An english US value</scenario4>
<scenario4 xml:lang="fr">A french value</scenario4>
<scenario4 xml:lang="fr-CA">A french Canada value</scenario4>

<scenario5 xml:lang="x-default" attr0="something here">The orig value</scenario5>
<scenario5 xml:lang="en" attr1="Some attribute">The orig value</scenario5>
<scenario5 xml:lang="en-US" attr2="some other attribute">The orig value</scenario5>
</arbitrarydepth>
Jon L.
  • 2,292
  • 2
  • 19
  • 31
  • What's the question? What have you tried and where are you stuck? – Daniel Haley Nov 13 '14 at 17:11
  • @DanielHaley, I've updated the post to ask the question. I don't have any clue how to actually achieve this dynamically. An answer I asked awhile back, has an answer that removes duplicates, but is not XSLT 2.0, and does not consider the hierarchy that I described. Prior answer: http://stackoverflow.com/a/26290199/441739 – Jon L. Nov 13 '14 at 17:19
  • Sorry, I meant, "A question I asked awhile back, has an answer..." – Jon L. Nov 13 '14 at 17:25
  • What does "it's same-type immediate siblings" refer to, to sibling elements of the same name, e.g. `scenario1` element siblings? Is there always an element with `xml:lang="x-default"` starting a "group"? – Martin Honnen Nov 13 '14 at 17:35
  • @MartinHonnen, correct, `scenario1` would only consider other `scenario1` nodes as siblings, and having a common parent. I expect the `x-default` entry to always exist within a group, but the actual order of elements in a grouping is not guaranteed. – Jon L. Nov 13 '14 at 17:42
  • Why is `The orig value` not removed from your expected output? – Mathias Müller Nov 13 '14 at 17:45
  • @MathiasMüller, because based on the hierarchy here, `en-US` is compared against `en`, and because it differs, it's not removed – Jon L. Nov 13 '14 at 17:47

1 Answers1

1

This should fulfill all requirements, except the last one about matching dynamic attributes:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="*">
    <xsl:variable name="elementName" select="name()"/>
    <xsl:variable name="contentText" select="normalize-space(.)"/>
    <xsl:choose>
        <xsl:when test="not(@xml:lang)">
            <!-- Non-lang element -->
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates select="*"/>
            </xsl:copy>
        </xsl:when>
        <xsl:when test="@xml:lang='x-default'">
            <!-- Tier 1: xml:lang="x-default" -->
            <xsl:copy-of select="."/>
        </xsl:when>
        <xsl:when test="contains(@xml:lang,'-')">
            <!-- Tier 3: xml:lang="en-US" -->
            <xsl:variable name="baselang" select="substring-before(@xml:lang, '-')"/>
            <xsl:choose>
                <xsl:when test="../*[name()=$elementName][@xml:lang=$baselang][normalize-space(.)=$contentText]">
                    <!-- Same text as Tier 2 parent -->
                </xsl:when>
                <xsl:when test="../*[name()=$elementName][@xml:lang=$baselang]">
                    <xsl:copy-of select="."/>
                </xsl:when>
                <xsl:when test="../*[name()=$elementName][@xml:lang='x-default'][normalize-space(.)=$contentText]">
                    <!-- Same text as Tier 1 parent -->
                </xsl:when>
                <xsl:when test="../*[name()=$elementName][@xml:lang='x-default']">
                    <xsl:copy-of select="."/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:copy-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:when>
        <xsl:otherwise>
            <!-- Tier 2: xml:lang="en" -->
            <xsl:choose>
                <xsl:when test="../*[name()=$elementName][@xml:lang='x-default'][normalize-space(.)=$contentText]">
                    <!-- Same text as Tier 1 parent -->
                </xsl:when>
                <xsl:when test="../*[name()=$elementName][@xml:lang='x-default']">
                    <xsl:copy-of select="."/>
                </xsl:when>
                <xsl:otherwise>
                    <!-- No matching parent -->
                    <xsl:copy-of select="."/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>
</xsl:stylesheet>

Demo: http://www.xsltcake.com/slices/uopn40

Matching dynamic attributes between parent and child is actually very complex. You have to loop trough the attributes and compare against the current parent. If any attribute is missing on the parent, or if it's value is different, you have to keep the new element.

To fulfill the last requirement, I think you have to move to an imperative language (C#, JavaScript, Java).

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
  • "*To fulfill the last requirement, I think you have to move to an imperative language*" Certainly not. XSLT (even XSLT 1.0) is Turing-complete. – michael.hor257k Nov 13 '14 at 21:08
  • @michael.hor257k, mind fulfilling that last requirement? – Jon L. Nov 13 '14 at 21:22
  • @markus-jarderot, thanks for that demo, it does indeed appear to solve all requirements except the last. I'll give it a go on a larger dataset later to see how it fares. – Jon L. Nov 13 '14 at 21:23
  • @JonL. I think I have outlined how this needs to approached; the rest is work. – michael.hor257k Nov 13 '14 at 21:24