Extract text from Word-Document using XSLT

Question

I must extract paragraphs (means: Headlines with their content) from a Word-document using XSLT. I have analyzed the structure and can reach the necessary nodes in the .docx-file with XSLT. But now i do not know how to group the content of the w:t-tags between the headings because Word splits the texts in a very strange way.

The input-data looks like:

<w:body xmlns:w="somenamespace">
   <w:p>
      <w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
      <w:r> <w:t>My Headl</w:t> </w:r>
      <w:r> <w:t>ine</w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 1.1.1 </w:t> </w:r>
      <w:r> <w:t>text 1.1.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 1.2.1 </w:t> </w:r>
      <w:r> <w:t>text 1.2.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
      <w:r> <w:t>My seco</w:t> </w:r>
      <w:r> <w:t>nd Headline</w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 2.1.1 </w:t> </w:r>
      <w:r> <w:t>text 2.1.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 2.2.1 </w:t> </w:r>
      <w:r> <w:t>text 2.2.2 </w:t> </w:r>
   </w:p>
</w:body>

Concatenating the content of a single paragraph is no problem. So it is simple to merge the data to a compact structure like the following:

<Document>
    <Paragraphs>
        <Headline>My Headline</Headline>
        <Content>text 1.1.1 text 1.1.2 </Content>
        <Content>text 1.2.1 text 1.2.2 </Content>
        <Headline>My second Headline</Headline>
        <Content>text 2.1.1 text 2.1.2 </Content>
        <Content>text 2.2.1 text 2.2.2 </Content>
    </Paragraphs>
</Document>

But this structure is not always useful because it still does not have one xml-element for the content of one paragraph. So does anyone know how to merge all paragraphs between the w:p-elements which does represent a headline? I would like to have an XSLT which transforms the w:body-content to a structure like:

<Document>
    <Paragraph>
        <Headline>My Headline</Headline>
        <Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
    </Paragraph>
    <Paragraph>
        <Headline>My second Headline</Headline>
        <Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
    </Paragraph>
</Document>

What i have found yet:

If a w:p-element contains a w:pPr-element then it is always the first child-node of this w:p-element
If a w:p-element matches on this condition ./w:pPr/w:pStyle[@w:val='Heading1']> then all w:r-elements in this w:p-element belongs to the headline of the paragraph.

score 1 · Accepted Answer · answered Dec 17 '19 at 10:24

This might be the solution for your problem. You need to use the for-each-group statement in xslt. You can match the whole w:p elements and define that the first element of a group is the w:p in which the heading style is defined. After that you can get the items by using the current-group function which gives you the while node-array of the group.

XSLT:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="somenamespace">
  <xsl:output method="xml" omit-xml-declaration="yes" />


  <xsl:template match="w:body">
    <Document>
      <xsl:for-each-group select="w:p" group-starting-with="*[./w:pPr/w:pStyle[@w:val='Heading1']]">
            <xsl:element name="Paragraph">
                <xsl:element name="Headline">
                    <xsl:value-of select="current-group()[1]/*/w:t/text()" />
                </xsl:element>
                <xsl:element name="Content">
                    <xsl:for-each select="current-group()[position()>1]/*">
                            <xsl:copy-of select="./w:t/text()" />
                    </xsl:for-each>
                </xsl:element>
            </xsl:element>
      </xsl:for-each-group>
    </Document>
  </xsl:template>

  <xsl:template match="*|node()">
    <xsl:apply-templates />
  </xsl:template>
</xsl:stylesheet>

Output:

<Document xmlns:w="somenamespace">
  <Paragraph>
    <Headline>My Headline</Headline>
    <Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
  </Paragraph>
  <Paragraph>
    <Headline>My second Headline</Headline>
    <Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
  </Paragraph>
</Document>

Extract text from Word-Document using XSLT

1 Answers1