4

UPDATE: I think I have answered most of this question now, except the handling of <pgBreak>. you can see my updates and current XSLT at the end of this post under the EDIT

I asked a similar question yesterday, and received good answers. However, I have since realized this didn't cover all my bases so I am asking a more detailed question today.

XML IN

<?xml version="1.0" encoding="UTF-8"?>    
<root>
<pgBreak pgId="i"/>
    <p xml:id="a-01">
        <highlight rend="italic">Bacon ipsum dolor sit amet</highlight> bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip 
        tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
        bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
        bacon filet mignon pork chop tail.
        <note.ref id="0001"><super>1</super></note.ref>
        <note id="0001">
            <p>
                You may need to consult a <highlight rend="italic">latin</highlight> butcher. Good Luck.
            </p>
        </note>   
        Pork loin <pgBreak pgId="01"/> ribeye bacon pastrami drumstick sirloin, shoulder pig jowl. Salami brisket rump ham, tail
        hamburger strip steak pig ham hock short ribs jerky shank beef spare ribs. Capicola short ribs swine   
        beef meatball jowl pork belly. Doner leberkas short ribs, flank chuck pancetta bresaola bacon ham 
        hock pork hamburger fatback.
    </p>
    <p xml:id="a-02">
        Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip 
        tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
        bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
        bacon filet mignon pork chop tail.
    </p>
    <p xml:id="a-03">
        Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip 
        tongue. 
            <quote>
                <p> 1.
                    Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
                    bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
                    bacon filet mignon pork chop tail.
                </p>
                <p> 2.
                    Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
                    bone. Sirloin <pgBreak pgId="02"/>turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
                    bacon filet mignon pork chop tail.
                </p>
                <p> 3.
                    Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
                    bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
                    bacon filet mignon pork chop tail.
                </p>
            </quote>
    </p>
</root>

HTML OUT

  <!DOCTYPE HTML>
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
      <title>Test</title>
   </head>
   <body>
      <div id="pg-i">
        Page i
      </div>
      <p data-chunkid="a-01"> 
         <span class="highlight-italic">Bacon ipsum dolor sit amet</span>bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip 
         tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin
         pastrami t-
         bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef
         hamburger 
         bacon filet mignon pork chop tail.
         <span class="noteRef" id="0001"><sup>1</sup></span></p>
      <div id="note-0001" data-chunkid="a-01">
         <p>
            You may need to consult a <span class="highlight-italic">latin</span> butcher. Good Luck.

         </p>
      </div>
      <p data-chunkid="a-01">   
         Pork loin
      </p>
      <div id="pg-01">
          Page 01
       </div>
        <p data-chunkId="a-01">
         ribeye bacon pastrami drumstick sirloin, shoulder pig jowl. Salami brisket
         rump ham, tail
         hamburger strip steak pig ham hock short ribs jerky shank beef spare ribs. Capicola
         short ribs swine   
         beef meatball jowl pork belly. Doner leberkas short ribs, flank chuck pancetta bresaola
         bacon ham 
         hock pork hamburger fatback.
       </p>
      <p data-chunkid="a-02"><span class="highlight-italic">Bacon ipsum dolor sit</span> amet bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip 
         tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin
         pastrami t-
         bone. Sirloin turducken short ribs <span class="highlight-bold">t-bone</span> andouille strip steak pork loin corned beef hamburger 
         bacon filet mignon pork chop tail.

      </p>

      <p data-chunkid="a-03">
         Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs
         doner tri-tip 
         tongue. 

      </p>
      <blockquote data-chunkid="a-03">
        <p> 1.
            Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
            bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
            bacon filet mignon pork chop tail.
        </p>
         <p>2.
               Tri-tip ground round <span class="highlight-italic">short ribs</span> capicola meatloaf shank drumstick short loin pastrami t-
               bone. Sirloin 
          </p>
       </blockquote>
       <div id="pg-02">
         Page: 02
       </div>
       <blockquote data-chunkid="a-03"> 
         </p>
               turducken short ribs t-bone andouille strip steak pork loin corned beef
               hamburger bacon filet mignon pork chop tail.

         </p>
        <p> 3.
            Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
            bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger 
            bacon filet mignon pork chop tail.
        </p>

      </blockquote>
      <p data-chunkid="a-03">
         Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs
         doner tri-tip 
         tongue. 

      </p>
   </body>
</html>

I would like to transform the xml to html5 but keep each chunk (xml:id) together. I want to avoid divits (overuse of divs) so wraping each p in a div is out, but I also am trying to avoid invalid HTML. for example it would be easy to take the parent p (xml:id=a-01) and wrap it aroud its descendants, however, a block level <div> and another <p> would be invalid, and the browser would intrepret everything after the end of the text as orphaned text.

I have tried various modified XSLTs from my question from yesterday. However, I find myself in a bit of unfamiliar territory. I would also benefit a concise explanation of the solution so I can start to better understand XSLT, as it looks like I will be spending more time with it in the upcoming months. I should probably pick up book by Michael Kay or something.

EDIT: current version of the XSLT I am working with

note: I Haven't attempted the page breaks yet. Also, I cannot get the <meta> tag to close....oxygen 14 keeps complaining about that.

<xsl:template match="/">
    <html>
        <body>
            <xsl:apply-templates/>
        </body>
    </html>
</xsl:template>

<xsl:template match="p[not((parent::note,.//p, .//div))]">
    <p data-chunkID="{@xml:id}">
        <xsl:apply-templates/>
    </p>
</xsl:template>

<xsl:template match="p[.//p, .//div]">
    <xsl:for-each-group select="node()" group-adjacent="boolean((self::text(), self::note.ref,self::highlight))">
        <xsl:choose>
            <xsl:when test="current-grouping-key()">
                <p data-chunkID="{../@xml:id}">
                    <xsl:apply-templates select="current-group()"/>
                </p>
            </xsl:when>
            <xsl:when test="self::p">
                <p>
                    <xsl:apply-templates/>
                </p>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates select="current-group()"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:for-each-group>
</xsl:template>

<xsl:template match="note.ref">
    <span class="noteRef" id="{@id}">
        <xsl:apply-templates/>
    </span>
</xsl:template>

<xsl:template match="super">
    <sup>
        <xsl:apply-templates/>
    </sup>
</xsl:template>

<xsl:template match="note">
    <div id="note-{@id}" data-chunkID="{../@xml:id}">
        <p>
        <xsl:apply-templates/>
        </p>
    </div>
</xsl:template>


<xsl:template match="quote">
    <blockquote data-chunkID="{../@xml:id}">
        <p>
        <xsl:apply-templates/>
        </p>
    </blockquote>
</xsl:template>



<xsl:template match="highlight">
    <xsl:variable name="class" select="concat(name(.),'-',string(@rend))"/>
    <xsl:choose>
        <xsl:when test="@rend[.= 'italic']">
            <span class="{$class}">
                <xsl:apply-templates/>
            </span>
        </xsl:when>
        <xsl:when test="@rend[.= 'bold']">
            <span class="{$class}">
                <xsl:apply-templates/>
            </span>
        </xsl:when>
        <xsl:otherwise>
            <span class="{$class}">
                <xsl:apply-templates/>
            </span>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Community
  • 1
  • 1
matchew
  • 19,195
  • 5
  • 44
  • 48
  • matchew -- this question needs a lot of improvement -- I don't understand what is wanted and what are the rules that the transformation must implement. Please, do not refer to "my question from yesterday" -- there is no such question asked "yesterday" -- but provide all the data and explanations that are needed in order to understand what you are asking for. It may be helpful to give simpler/smaller XML document example. – Dimitre Novatchev Jan 16 '13 at 13:23
  • @DimitreNovatchev Thank you for trying to comprehend my question. 1. What is unclear? 2. I asked this on 12 dec after asking a question on 11 dec. So, it was 'yesterday', further the day is irrelevant when I link to the previous question. The question on 11 dec was a slimmed down version of my XML and did not cover all my cases. 3. This is as small as I can make the document. I have already trimmed it down extensively from what I am actually working with. – matchew Jan 16 '13 at 17:30
  • matchew, What is unclear are the rules for splitting -- I think it would be best to provide a specific example for every rule and to explain the rule. I cannot understand what actually is wanted, what is specifically required to obtain the provided wanted result and what rules led to what generated output. – Dimitre Novatchev Jan 16 '13 at 17:34
  • @matchew 3 hours left to the bounty, I answered 5 days ago. Any feedback? – JLRishe Jan 22 '13 at 18:50
  • 1
    @JLRishe Sorry, I haven't had the time to devote to this question and make sure this works the way I expected it too. I did run it quickly last week and it *seemed* to work, but I wanted to actually take time and parse what it is your solution was doing. I unfortunately wont have the time to do that at work today. However, I will award you the bounty. Thank you. – matchew Jan 22 '13 at 19:08
  • Thank you graciously @matchew. If you try it out and it doesn't work the way you hope, I'll be glad to continue looking for a solution. – JLRishe Jan 22 '13 at 19:11

1 Answers1

1

It looks like your input is a little bit inconsistent with your output. (Is that the expected output, or the output you're getting now)? Chunks a-02 and a-03 have no <highlight> elements in the input, yet the output has <span class="highlight..."> elements. Also, chunk a-03 has text duplicated after the blockquote.

I believe I've produced a working solution that does everything in your example. Could you give this a try?

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/">
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title>Test</title>
      </head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="p | div">
    <xsl:variable name="breaks" select="note | pgBreak | quote" />
    <xsl:variable name="firstNonBreak" select="node()[count(. | $breaks) != count($breaks)][1]" />
    <xsl:variable name="nonBreaksAfterBreak"
                  select="$breaks/following-sibling::node()[1][count(. | $breaks) != count($breaks)]" />

    <xsl:apply-templates select="$breaks | $firstNonBreak | $nonBreaksAfterBreak" mode="sectChild" />
  </xsl:template>

  <!-- Template to output the chunk id attribute of a particular hierarchy -->
  <xsl:template name="ChunkId">
    <xsl:variable name="id" select="ancestor::*[../self::root]/@xml:id" />
    <xsl:if test="$id">
      <xsl:attribute name="data-chunkid">
        <xsl:value-of select="$id"/>
      </xsl:attribute>
    </xsl:if>
  </xsl:template>

  <!-- Splitting types - notes, page breaks, quotes -->
  <xsl:template match="pgBreak" mode="sectChild">
    <div id="pg-{@pgId}">
      <xsl:value-of select="concat('Page ', @pgId)"/>
    </div>
  </xsl:template>

  <xsl:template match="quote | note" mode="sectChild">
    <xsl:apply-templates />
  </xsl:template>

  <!-- Receives the first node of each block of content outside of the splitting types
       and passes processing onto itself and siblings within its block-->
  <xsl:template match="text() | highlight | note.ref | super" mode="sectChild">

    <xsl:variable name="content">
      <xsl:apply-templates select="." mode="buildContent" />
    </xsl:variable>

    <xsl:if test="normalize-space($content)">
      <xsl:call-template name="Nest">
        <xsl:with-param name="hierarchy" select="ancestor::*[not(self::root)]" />
        <xsl:with-param name="content" select="$content" />
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <!-- Recursive template to output nodes from the top level down to content -->
  <xsl:template name="Nest">
    <xsl:param name="topLevel" select="true()"/>
    <xsl:param name="hierarchy" />
    <xsl:param name="content" />

    <xsl:variable name="top" select="$hierarchy[1]" />
    <xsl:variable name="remainder" select="$hierarchy[position() > 1]" />

    <!-- If there's a quote or note yet to come, don't output tags until we get there -->
    <xsl:variable name="skipTags" select="boolean($remainder[self::quote or self::note])" />
    <!-- Recursive output is captured in a variable, to be output later in this template -->
    <xsl:variable name="inside">
      <xsl:if test="$hierarchy">
        <xsl:call-template name="Nest">
          <xsl:with-param name="topLevel" select="$topLevel and $skipTags" />
          <xsl:with-param name="hierarchy" select="$remainder" />
          <xsl:with-param name="content" select="$content" />
        </xsl:call-template>
      </xsl:if>
    </xsl:variable>

    <xsl:choose>
      <xsl:when test="not($hierarchy)">
        <xsl:copy-of select="$content" />
      </xsl:when>
      <xsl:when test="$top/self::quote">
        <blockquote>
          <xsl:call-template name="ChunkId" />
          <xsl:copy-of select="$inside"/>
        </blockquote>
      </xsl:when>
      <xsl:when test="$top/self::note">
        <div id="note-{$top/@id}">
          <xsl:call-template name="ChunkId" />
          <xsl:copy-of select="$inside"/>
        </div>
      </xsl:when>
      <xsl:when test="not($skipTags)">
        <xsl:element name="{name($top)}">
          <xsl:if test="$topLevel">
            <xsl:call-template name="ChunkId" />
          </xsl:if>
          <xsl:copy-of select="$inside"/>
        </xsl:element>
      </xsl:when>
      <xsl:otherwise>
        <xsl:copy-of select="$inside"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template match="node()" mode="buildContent">
    <xsl:if test="not(self::note or self::quote or self::pgBreak)">
      <!-- output this node -->
      <xsl:apply-templates select="self::node()[normalize-space(.)]" mode="contentOutput" />
      <!-- pass processing onto next sibling -->
      <xsl:apply-templates select="following-sibling::node()[1]" mode="buildContent" />
    </xsl:if>
  </xsl:template>

  <!-- Bottom level content - text, note refs, superscript, highlight-->
  <xsl:template match="text()" mode="contentOutput">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="note.ref" mode="contentOutput">
    <span class="noteRef" id="{@id}">
      <xsl:apply-templates mode="contentOutput"/>
    </span>
  </xsl:template>

  <xsl:template match="super" mode="contentOutput">
    <sup>
      <xsl:apply-templates mode="contentOutput"/>
    </sup>
  </xsl:template>

  <xsl:template match="highlight" mode="contentOutput">
    <xsl:variable name="class" select="concat(name(.),'-',string(@rend))"/>
    <span class="{$class}">
      <xsl:apply-templates mode="contentOutput"/>
    </span>
  </xsl:template>
</xsl:stylesheet>

I believe the unclosed meta tags is a result of using method="html". You may need to use method="xml" to get closed meta tags. With method="html", the above transform produces the following output from your sample input:

<html>
  <head>
    <META http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Test</title>
  </head>
  <body>
  <p data-chunkid="a-01"><span class="highlight-italic">Bacon ipsum dolor sit amet</span> bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip
    tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
    bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger
    bacon filet mignon pork chop tail.
    <span class="noteRef" id="0001">
      <sup>1</sup>
    </span></p>
      <div id="note-0001" data-chunkid="a-01">
      <p>
        You may need to consult a <span class="highlight-italic">latin</span> butcher. Good Luck.
      </p>
    </div>
    <p data-chunkid="a-01">
    Pork loin </p>
    <div id="pg-01">Page 01</div>
    <p data-chunkid="a-01"> ribeye bacon pastrami drumstick sirloin, shoulder pig jowl. Salami brisket rump ham, tail
    hamburger strip steak pig ham hock short ribs jerky shank beef spare ribs. Capicola short ribs swine
    beef meatball jowl pork belly. Doner leberkas short ribs, flank chuck pancetta bresaola bacon ham
    hock pork hamburger fatback.
  </p>
  <p data-chunkid="a-02">
    Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip
    tongue. Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
    bone. Sirloin turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger
    bacon filet mignon pork chop tail.
  </p>
  <p data-chunkid="a-03">
    Bacon ipsum dolor sit amet bacon chuck pastrami swine pork rump, shoulder beef ribs doner tri-tip
    tongue.
    </p>
      <blockquote data-chunkid="a-03">
      <p>
        Tri-tip ground round short ribs capicola meatloaf shank drumstick short loin pastrami t-
        bone. Sirloin </p>
    </blockquote>
    <div id="pg-02">Page 02</div>
    <blockquote data-chunkid="a-03">
      <p>turducken short ribs t-bone andouille strip steak pork loin corned beef hamburger
        bacon filet mignon pork chop tail.
      </p>
    </blockquote>

</body>
</html>

By changing the method to "xml" and manually adding the meta element to the transform, you can obtain the same result, but with the following <head>

  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Test</title>
  </head>
JLRishe
  • 99,490
  • 19
  • 131
  • 169
  • Hey, I ran this quickly and it might just work. I will review this more when I am not working (tomorrow). But for now have an upvote. Thanks. – matchew Jan 18 '13 at 21:30
  • @matchew So, any feedback on this XSLT? Does it work for your requirements? Anything that needs touching up? – JLRishe Jan 21 '13 at 07:56
  • I know you answered this question the other month, but after feeding it some new documents in a similar structure I have come across a problem. If a has more than one paragraph (

    ) it outputs to several

    's. Would you be interested in supplying a solution for this, or would you be kind enough to tell me that you are not interested and I can happily look elsewhere for help. Thought you might be interested. Thanks.
    – matchew Mar 18 '13 at 21:38
  • Sure, I will try to look into this soon. In the meantime, could you supply a sample input XML that demonstrates this issue? – JLRishe Mar 19 '13 at 02:44
  • Thank you so much! your previously answer has really helped me a lot the last few months. Notice my edit to the original question, namely the last section of the XML where the quote has three paragraphs. (previously had one) ...I've gone ahead and given you some upvotes on other questions, as I cannot give you anything else for this question. – matchew Mar 19 '13 at 04:22
  • I've gone ahead an asked a new question. It really is the more appropriate way of going about this. Thank you for the input. Its been great. http://stackoverflow.com/questions/15505774/exit-and-reconstruct-element-s-at-pgbreak – matchew Mar 19 '13 at 17:00
  • Glad to help. I spent some time today trying to get it working better, but still grappling with it. Hopefully I can have it working in the next day or so. – JLRishe Mar 19 '13 at 17:07
  • thanks. I did add a new element in my new question, and removed the need to track the chunks. – matchew Mar 19 '13 at 17:10