3

Hi I require help parsing the following XML.

<xmeml>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>1234</Unit2>
    </Test>
    <Test>
        <Unit>bcd</Unit>
        <Unit2>2345</Unit2>
    </Test>
</Doc>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>3456</Unit2>
    </Test>
    <Test>
        <Unit>cde</Unit>
        <Unit2>3456</Unit2>
    </Test> 
</Doc>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>1234</Unit2>
    </Test>
    <Test>
        <Unit>def</Unit>
        <Unit2>4567</Unit2>
    </Test> 
</Doc>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>1234</Unit2>
    </Test>
    <Test>
        <Unit>efg</Unit>
        <Unit2>2345</Unit2>
    </Test> 
</Doc>
</xmeml>

ending up with the following

<xmeml>
<Doc>
    <Test>
        <Unit>bcd</Unit>
        <Unit2>2345</Unit2>
    </Test>
</Doc>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>3456</Unit2>
    </Test>
    <Test>
        <Unit>cde</Unit>
        <Unit2>3456</Unit2>
    </Test> 
</Doc>
<Doc>
    <Test>
        <Unit>def</Unit>
        <Unit2>4567</Unit2>
    </Test> 
</Doc>
<Doc>
    <Test>
        <Unit>abc</Unit>
        <Unit2>1234</Unit2>
    </Test>
    <Test>
        <Unit>efg</Unit>
        <Unit2>2345</Unit2>
    </Test> 
</Doc>
</xmeml>

I am attempting to create a XSLT doc to do this but as yet have not found one that works. I should note that the required matching parameters within 'Doc' are , in this case "abc" and "1234', In the real world these are variables and will never be a static searchable entity.

So in english my XSL would be like this: For any parent containing both matching 'Unit' and 'unit2' values delete all preceding parents 'Test' containing a duplicate value of 'Unit' and 'Unit2' except the last.

All help most appreciated Thanks

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
user1540142
  • 185
  • 2
  • 15

3 Answers3

2

Here's a relatively simple way of doing it, although I'm fairly sure there's a more efficient way using the Meunchian method. If performance isn't an issue however, this is probably easier to follow:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml"/>

  <xsl:template match="Test">
    <xsl:variable name="vUnit" select="Unit" />
    <xsl:variable name="vUnit2" select="Unit2" />
    <xsl:if test="not(following::Test[Unit = $vUnit and Unit2 = $vUnit2])">
      <xsl:call-template name="identity" />
    </xsl:if>
  </xsl:template>

  <xsl:template match="@* | node()" name="identity">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

The Test template simply checks if there's a later Test element with the same values in Unit and Unit2, and if there isn't, it outputs it as normal.

Flynn1179
  • 11,925
  • 6
  • 38
  • 74
  • Thanks Flynn. Perfect Score for you. – user1540142 Jul 21 '12 at 11:35
  • Hi Flynn. How would I adapt the above if I wanted to add another match phrase. IE: search for duplicates of Unit and Unit2 but also looking for a match where say Unit3 is a known value of 4567. We'll have to imagine Unit3 is also present in all the above Test nodes. – user1540142 Jul 22 '12 at 00:08
  • That's fairly easy, just add `and Unit3 = 4567` into the `Test` predicate in the `test` attribute of `xsl:if` – Flynn1179 Jul 22 '12 at 00:34
  • Thanks Again Flynn, how about Unit3 starts with 456? Something like and Unit3[starts-with(.,'456')] ? – user1540142 Jul 22 '12 at 00:42
  • Pretty much exactly like that, although personally, I'd do `starts-with(Unit3, '456')`, but it's essentially the same. – Flynn1179 Jul 22 '12 at 00:59
  • Flynn, Do I have to define Unit 3 as a variable as well. i.e. and then add starts-with($vUnit3, '456) . – user1540142 Jul 22 '12 at 01:28
  • Hi Flynn, Sorry , don't worry i asked a more specific question. Thanks though I'm using your first result as well. – user1540142 Jul 22 '12 at 03:02
1

Many problems involving elimination of duplicates can be tackled in XSLT 2.0 using the for-each-group construct. In this case, the solution using for-each-group isn't obvious, because it's not actually a grouping problem (with grouping problems, we are generally producing one element in the output that corresponds to a group of elements in the input, and that is not the case here.) I would tackle it the same way as Dimitre: use for-each-group to identify the groups, and hence the Test elements that need to be retained versus those that need to be deleted. In fact I started solving this and came up with a solution that was very similar to Dimitre's, except that I think the last template rule can be simplified to

<xsl:template match="Test[not(. intersect $vLastInGroup)]"/>

It's an example of a coding pattern I sometimes use where you set up node-set-valued global variables containing all the elements with a particular characteristic, and then use template rules that test for membership of the global node-set (using the predicate [. intersect $node-set]). Following this pattern, and using some new syntax available in XSLT 3.0, I would tend to write the code like this:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>
 <xsl:mode on-no-match="shallow-copy"/>

 <xsl:variable name="deletedElements" as="element()*">
  <xsl:for-each-group select="/*/Doc/Test"
                      group-by="Unit, Unit2" composite="yes">
   <xsl:sequence select="current-group()[position() ne last()]"/>
  </xsl:for-each-group>
 </xsl:variable>

 <xsl:template match="$deletedElements"/>
</xsl:stylesheet>
Flynn1179
  • 11,925
  • 6
  • 38
  • 74
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

I. XSLT 1.0 Solution:

Here is a simple (no variables, no xsl:if, no axes, no xsl:call-template) application of the most efficient known XSLT 1.0 grouping method -- Muenchian grouping:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kTestByData" match="Test"
  use="concat(Unit, '|', Unit2)"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "Test[not(generate-id()
           = generate-id(key('kTestByData',concat(Unit, '|', Unit2))[last()])
            )]"/>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<xmeml>
    <Doc>
        <Test>
            <Unit>abc</Unit>
            <Unit2>1234</Unit2>
        </Test>
        <Test>
            <Unit>bcd</Unit>
            <Unit2>2345</Unit2>
        </Test>
    </Doc>
    <Doc>
        <Test>
            <Unit>abc</Unit>
            <Unit2>3456</Unit2>
        </Test>
        <Test>
            <Unit>cde</Unit>
            <Unit2>3456</Unit2>
        </Test>
    </Doc>
    <Doc>
        <Test>
            <Unit>abc</Unit>
            <Unit2>1234</Unit2>
        </Test>
        <Test>
            <Unit>def</Unit>
            <Unit2>4567</Unit2>
        </Test>
    </Doc>
    <Doc>
        <Test>
            <Unit>abc</Unit>
            <Unit2>1234</Unit2>
        </Test>
        <Test>
            <Unit>efg</Unit>
            <Unit2>2345</Unit2>
        </Test>
    </Doc>
</xmeml>

the wanted, correct result is produced:

<xmeml>
   <Doc>
      <Test>
         <Unit>bcd</Unit>
         <Unit2>2345</Unit2>
      </Test>
   </Doc>
   <Doc>
      <Test>
         <Unit>abc</Unit>
         <Unit2>3456</Unit2>
      </Test>
      <Test>
         <Unit>cde</Unit>
         <Unit2>3456</Unit2>
      </Test>
   </Doc>
   <Doc>
      <Test>
         <Unit>def</Unit>
         <Unit2>4567</Unit2>
      </Test>
   </Doc>
   <Doc>
      <Test>
         <Unit>abc</Unit>
         <Unit2>1234</Unit2>
      </Test>
      <Test>
         <Unit>efg</Unit>
         <Unit2>2345</Unit2>
      </Test>
   </Doc>
</xmeml>

Do note: For node-sets with big number of nodes to be de-duped the Muenchian grouping method is many factors of magnitude faster than the quadratical (O(N^2)) sibling comparison grouping.


II. XSLT 2.0 solutions:

II.1 Here is a simple (non-efficient and suitable for node-sets with small length) XSLT 2.0 solution:

<xsl:stylesheet version="2.0"   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "Test[concat(Unit, '+', Unit2) = following::Test/concat(Unit, '+', Unit2)]"/>
</xsl:stylesheet>

II.2 An efficient solution using xsl:for-each-group:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vLastInGroup" as="element()*">
  <xsl:for-each-group select="/*/Doc/Test"
       group-by="concat(Unit, '+', Unit2)">
   <xsl:sequence select="current-group()[last()]"/>
  </xsl:for-each-group>
 </xsl:variable>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
 "Test[for $t in .
        return
         not($vLastInGroup[. is $t])
      ]"/>
</xsl:stylesheet>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • 'Simple' is a relative concept. To an expert user such as yourself I'd agree that this is a simpler solution, but to the beginner/novice in XSLT, Muenchian grouping's actually quite difficult to get to grips with. Although it's not the most efficient (as I already said), checking for the absence of following siblings is much easier to understand, given that it's conceptually similar to the description of the solution needed. However, if you need the speed and you can understand it, I'd agree this solution's a better option. – Flynn1179 Jul 21 '12 at 15:32
  • @Flynn1179: As for understandability -- I agree with you. However your transformation is unnecessarily complex and this actually makes it more difficult to understand. With this comment I am actually contributing to making your solution simpler and more elegant. – Dimitre Novatchev Jul 21 '12 at 15:38
  • @Flynn1179: There is some objective measure of understandability -- simply put, the number of "moving parts". If one solution has 1one `xsl:if`, two `xsl:variable`, one `following` axis, one `xsl:call-template` -- and the other has none of these, then which one is simpler? – Dimitre Novatchev Jul 21 '12 at 16:41
  • @MichaelKay, It is surprizing for me that it is very difficult and complicated to solve this problem using `xsl:for-each-group` -- compared to the first, Muenchian solution. Can you suggest a better solution? – Dimitre Novatchev Jul 21 '12 at 17:09
  • Thanks Again Dimitre, re the xsl v1.0 solution, sorry to do this to you but.... what would I do if I also wanted to add a third item to the key which was something to the effect of Unit3 starts-with '345'. Well have to pretend that Unit3 is present in each Test node as well. – user1540142 Jul 22 '12 at 02:03
  • actually Dimitre is a different question again. so disregard the prior comment . Thanks Anyway :) – user1540142 Jul 22 '12 at 02:07
  • @user1540142: your comments are incomprehensible -- please, start a new question and explain thoroughly. – Dimitre Novatchev Jul 22 '12 at 03:04