1

This is not so much 'how do I do xxx' but 'how do I do xxx optimally?' (really hoping the challenge floats Dimitre's boat...)

All of the following is complicated by the restriction of the XSL processor (msxsl - basically XSLT 1.0 with a node-set(), replaces() and matches() set of extension functions).

I am generating some metadata from certain elements in a book - let's say chapters and div[title] elements (to simplify our data model quite a bit).

Page numbers in the book are given by processing instructions in mixed text nodes that might look like this:

<?Page pageId="256"?>

The page number that my element needs to be associated with will either be the first descendant (in the case where the page break is essentially the first piece of content within, say, a chapter (i.e. the chapter starts with a new page)), or else the first preceding::processing-instruction('Page').

Let's make up a sample document:

<?xml version="1.0" encoding="UTF-8"?>
<book>
    <chapter>
        <title><?Page pageId="1"?>Chapter I</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>Second paragraph <?Page pageId="2"?>of introduction</p>
        </div>
        <div>
            <title>Section I</title>
            <p>A paragraph</p>
            <p>Another paragraph<?Page pageID="3"?></p>
        </div>
    </chapter>
    <chapter>
        <title><?Page pageId="4"?>Chapter II</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>...</p>
        </div>
    </chapter>
</book>

(note that although each chapter here starts on a new page, we can't generally guarantee that as a rule. There's a blank page at the end of chapter 1, something we see commonly).

I want to get out some information like this (I am fine with XSLT basics, we're interested in choosing the page numbers):

<meta>
    <meta>
        <field type="title">Chapter I</field>
        <field type="page">1</field>
        <meta>
            <field type="title">Section I</field>
            <field type="page">2</field>
        </meta>
    </meta>
    <meta>
        <field type="title">Chapter II</field>
        <field type="page">4</field>
    </meta>
</meta>

I can do various things using xsl:when statements and the descendant axis to decide which page number is appropriate, but I would much prefer to do something clever matching on processing-instructions, as currently using the descendant axis on large books is making things way too slow to be usable. Keys would be nice, but things are further complicated by being able to use neither variables nor other keys in the @use or @match attributes (and not being able to use sequence constructors, similarly).

Currently the elements I'm interested in finding page numbers for are defined in a key (real world data is much more complex) like the following:

<xsl:key name="auth" match="chapter|div[title]" use="generate-id()"/>

Any suggestions or pointers gratefully received!

Tom Hillman
  • 327
  • 1
  • 10
  • So one way to identify the correct page number is to look at the first preceding page PI from the first descendant non-whitespace text() node - but this involves using two of the most inefficient axes in xpath, and is slowing down the script massively. descendant::text()[normalize-space()!=''][1]/preceding::processing-instruction('Page')[1] – Tom Hillman Apr 13 '12 at 15:10

1 Answers1

1

Here is a solution using keys, which may be efficient:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kPage"
   match="chapter/title/processing-instruction('Page')"
   use="generate-id(..)"/>

 <xsl:key name="kPage"
   match="processing-instruction('Page')"
   use="generate-id(following::div[title][1]/title)"/>

 <xsl:template match="*">
  <xsl:apply-templates select=
   "*[1]|following-sibling::*[1]"/>
 </xsl:template>

 <xsl:template match="chapter/title[1] | div/title[1]">
  <meta>
    <field type="title"><xsl:value-of select="."/></field>
    <field type="page">
      <xsl:variable name="vPiText"
           select="key('kPage', generate-id())[last()]"/>
      <xsl:value-of select=
      "translate($vPiText,
                 translate($vPiText, '01234567890', ''),
                 ''
                 )"/>
    </field>

    <xsl:apply-templates select="*[1]|following-sibling::*[1]"/>
  </meta>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the provided XML document:

<book>
    <chapter>
        <title>
            <?Page pageId="1"?>Chapter I</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>Second paragraph 
                <?Page pageId="2"?>of introduction</p>
        </div>
        <div>
            <title>Section I</title>
            <p>A paragraph</p>
            <p>Another paragraph
                <?Page pageID="3"?></p>
        </div>
    </chapter>
    <chapter>
        <title>
            <?Page pageId="4"?>Chapter II</title>
        <div>
            <p>Introduction to Chapter</p>
            <p>...</p>
        </div>
    </chapter>
</book>

the wanted, correct result is produced:

<meta>
   <field type="title">Chapter I</field>
   <field type="page">1</field>
   <meta>
      <field type="title">Section I</field>
      <field type="page">2</field>
   </meta>
</meta>
<meta>
   <field type="title">Chapter II</field>
   <field type="page">4</field>
</meta>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Thanks for replying, Dimitre! Unfortunately I think that this solution depends a bit too strongly on the simplifications I've made in the example. It's a bit frustrating, but I'm obviously restricted on posting full book samples on here both because of size and IP concerns; plus it would be so complex that it would make for a pretty unfair question. – Tom Hillman Apr 16 '12 at 09:59
  • I'm thinking of another approach: creating a simplified node tree of 'meta' items with the relative positions of first text nodes and processing instructions, and then using the descendant and following axes on that... will post results. – Tom Hillman Apr 16 '12 at 10:03
  • @yamaxito -- Yes, I have thought about a two-pass approach. The first pass should turn the page PIs into element, and maybe do some small restructuring, so that the page information will be easily available. Naturally, I cannot do more if the provided example isn't fully representative of the real source. – Dimitre Novatchev Apr 16 '12 at 12:02