4

To help reverse engineer XML files, I'm using a Python SAX handler as below. Can somebody provide an equivalent XSLT to perform the same job ? This is an example input file:

<beatles>
  <beatle>
    <name>
      <first>John</first>
      <last>Lennon</last>
    </name>
  </beatle>
  <beatle>
    <name>
      <first>Paul</first>
      <last>McCartney</last>
    </name>
  </beatle>
  <beatle>
    <name>
      <first>George</first>
      <last>Harrison</last>
    </name>
  </beatle>
  <beatle>
    <name>
      <first>Ringo</first>
      <last>Starr</last>
    </name>
  </beatle>
</beatles>

So the idea is to get a list of all unique paths (ignoring attributes) to get a basic starting point to writing templates etc.

from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from xml.sax import SAXParseException

class ShowPaths(ContentHandler):

    def startDocument(self):
        self.unique_paths=[]
        self.current_path=[]


    def startElement(self,name,attrs):
        self.current_path.append(name)
        path="/".join(self.current_path)
        if path not in self.unique_paths:
            self.unique_paths.append(path)

    def endElement(self,name):
        self.current_path.pop();

    def endDocument(self):
        for path in self.unique_paths:
            print path

if __name__=='__main__':
    handler = ShowPaths()
    saxparser = make_parser()
    saxparser.setContentHandler(handler)
    in_f=open("d:\\beatles.xml","r")
    saxparser.parse(in_f)  
    in_f.close()

And the result of running the program over the example:

beatles
beatles/beatle
beatles/beatle/name
beatles/beatle/name/first
beatles/beatle/name/last
monojohnny
  • 5,894
  • 16
  • 59
  • 83
  • bear with me: I can't get this thing to format correctly (the format for slash-separated paths are being joined on a single line and the XML is actually being processed and displayed as text ????) – monojohnny Apr 20 '11 at 11:02
  • Can't get the XML to display literally in the post: so here it is John Lennon Paul McCartney George Harrison Ringo Starr – monojohnny Apr 20 '11 at 11:06
  • Good question, +1. See my answer for a complete, short and easy XSLT solution. :) – Dimitre Novatchev Apr 20 '11 at 12:51
  • I recently answered this in a related question. Here: http://stackoverflow.com/questions/5695964/output-context-node-full-path-in-xslt-1-0/5705457#5705457 – Wayne Apr 20 '11 at 17:54
  • Thanks for that (I upvoted that post) - it was a related question I had in fact ! (One thing led to another...) – monojohnny Apr 20 '11 at 22:47
  • @monojohnny: For disctintion, you need to known all the key values in advance. That breaks the conditions for streaming. –  Apr 20 '11 at 23:03

2 Answers2

3

So the idea is to get a list of all unique paths (ignoring attributes) to get a basic starting point to writing templates etc

This is easy:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="*">
        <xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
        <xsl:text>&#xA;</xsl:text>
        <xsl:apply-templates/>
 </xsl:template>

 <xsl:template match="*" mode="path">
  <xsl:value-of select="concat('/',name())"/>

  <xsl:variable name="vnumPrecSiblings" select=
        "count(preceding-sibling::*[name()=name(current())])"/>
  <xsl:variable name="vnumFollSiblings" select=
        "count(following-sibling::*[name()=name(current())])"/>

  <xsl:if test="$vnumPrecSiblings or $vnumFollSiblings">
   <xsl:value-of select=
     "concat('[', $vnumPrecSiblings +1, ']')"/>
  </xsl:if>
 </xsl:template>

 <xsl:template match="text()"/>
</xsl:stylesheet>

when this transformation is applied on the provided XML document:

<beatles>
    <beatle>
        <name>
            <first>John</first>
            <last>Lennon</last>
        </name>
    </beatle>
    <beatle>
        <name>
            <first>Paul</first>
            <last>McCartney</last>
        </name>
    </beatle>
    <beatle>
        <name>
            <first>George</first>
            <last>Harrison</last>
        </name>
    </beatle>
    <beatle>
        <name>
            <first>Ringo</first>
            <last>Starr</last>
        </name>
    </beatle>
</beatles>

the wanted, correct result is produced:

/beatles
/beatles/beatle[1]
/beatles/beatle[1]/name
/beatles/beatle[1]/name/first
/beatles/beatle[1]/name/last
/beatles/beatle[2]
/beatles/beatle[2]/name
/beatles/beatle[2]/name/first
/beatles/beatle[2]/name/last
/beatles/beatle[3]
/beatles/beatle[3]/name
/beatles/beatle[3]/name/first
/beatles/beatle[3]/name/last
/beatles/beatle[4]
/beatles/beatle[4]/name
/beatles/beatle[4]/name/first
/beatles/beatle[4]/name/last
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Yup. nice one: (this is just an observation about SAX / XSLT not your coding!) : this is quite slow to render a largish XML file, but I guess its not really the type of task XSLT was designed for. Thanks again! – monojohnny Apr 20 '11 at 22:41
  • @monojohny: You are welcome. Giving the most efficient solution was not part of your question. In fact, I am using almost the same code for a fine-grained XML-diff tool and it works very fast even with rather large XML documents -- completely sufficient for all my needs. Maybe your combination of hardware and XSLT processor leaves something to be desired... It is possible to implement a similar algorithm with only linear complexity -- in case you ask a new question I'd be glad to provide the answer. :) – Dimitre Novatchev Apr 20 '11 at 22:49
  • Again , just an observation - the output of this template seems to repeat itself : I get things like : /xyz/abc[1], /xyz/abc[2]...hence I think part of the slowness (bit output file for firefox to handle...) – monojohnny Apr 20 '11 at 22:52
  • 1
    @monojohny: See my answer to the answer of @andyb. In my practical work I often need different templates matching nodes having almost the same path, differing only in the specific position of a child element. This is not an artificial requirement but a typical one -- for example when producing a delimited list, where the last node is treated in a special way (no delimiter is output following the "regular" output). It is a good practice to avoid conditionals inside a template body and to make the condition(s) part of the match pattern. I strongly recommend following this principle. – Dimitre Novatchev Apr 20 '11 at 22:58
  • @Dimitre: don't think its the combination of h/w / XSLT processor - other XSLTs render very quickly using my firefox with the same input - I think it is due to the size of the _output_ tree produced (which _does_ slow down firefox I found). In any case,you have provided a very nice starting point for what I need to do.thanks again. – monojohnny Apr 21 '11 at 13:00
1

I might be missing the point here but I understood the question to mean that you wanted unique named paths.

So from this XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="xsl">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="nodeName" match="node()" use="name()"/>

<xsl:template match="//*[not(*)]"/>

<xsl:template match="/">
  <paths>
    <xsl:apply-templates select="//*[not(*)]"/>
  </paths>
</xsl:template>

<xsl:template match="node()[count(. | key('nodeName', name())[1]) = 1]" >
  <xsl:choose>
    <xsl:when test="not(child::*)">
      <path>
        <xsl:apply-templates select="parent::*"/>
        <xsl:value-of select="concat('/', name())"/>
      </path>
    </xsl:when>
    <xsl:otherwise>
      <xsl:apply-templates select="parent::*"/>
      <xsl:value-of select="concat('/', name())"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>
</xsl:stylesheet>

I get the following output:

<paths>
  <path>/beatles/beatle/name/first</path>
  <path>/beatles/beatle/name/last</path>
</paths>
andyb
  • 43,435
  • 12
  • 121
  • 150
  • In real world XSLT programming all paths, not only the names are important. I often have two different templates, one matching all but the last `someElement` and one that matches the last `someElement`. – Dimitre Novatchev Apr 20 '11 at 14:32
  • Thanks for this: thinking about it the result you generated here is good enough for my needs - my term 'unique' was a bit ambigious I guess. BTW: I tried this template on a 'real' (150k+ lines of xml, if 'lines' have much meaning) in Firefox and it choked with "Error during XSLT transformation: XSLT Stylesheet (possibly) contains a recursion." - not a criticsm - just reporting this for completeness! Cheers – monojohnny Apr 20 '11 at 22:45
  • I am not seeing that problem with my testing unfortunately. I guess we are using different parsers. I just verified the solution on http://www.freeformatter.com/xsl-transformer.html. The XSLT processor should understand recursion since it's a fundamental principle of the language. Maybe the `msxml.exe` parser has a bug? Which version are you using? It could be that my syntax was allowed but newer standards have made it invalid. – andyb Oct 31 '15 at 16:00