I must extract paragraphs (means: Headlines with their content) from a Word-document using XSLT. I have analyzed the structure and can reach the necessary nodes in the .docx-file with XSLT. But now i do not know how to group the content of the w:t
-tags between the headings because Word splits the texts in a very strange way.
The input-data looks like:
<w:body xmlns:w="somenamespace">
<w:p>
<w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
<w:r> <w:t>My Headl</w:t> </w:r>
<w:r> <w:t>ine</w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 1.1.1 </w:t> </w:r>
<w:r> <w:t>text 1.1.2 </w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 1.2.1 </w:t> </w:r>
<w:r> <w:t>text 1.2.2 </w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
<w:r> <w:t>My seco</w:t> </w:r>
<w:r> <w:t>nd Headline</w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 2.1.1 </w:t> </w:r>
<w:r> <w:t>text 2.1.2 </w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 2.2.1 </w:t> </w:r>
<w:r> <w:t>text 2.2.2 </w:t> </w:r>
</w:p>
</w:body>
Concatenating the content of a single paragraph is no problem. So it is simple to merge the data to a compact structure like the following:
<Document>
<Paragraphs>
<Headline>My Headline</Headline>
<Content>text 1.1.1 text 1.1.2 </Content>
<Content>text 1.2.1 text 1.2.2 </Content>
<Headline>My second Headline</Headline>
<Content>text 2.1.1 text 2.1.2 </Content>
<Content>text 2.2.1 text 2.2.2 </Content>
</Paragraphs>
</Document>
But this structure is not always useful because it still does not have one xml-element for the content of one paragraph.
So does anyone know how to merge all paragraphs between the w:p
-elements which does represent a headline?
I would like to have an XSLT which transforms the w:body
-content to a structure like:
<Document>
<Paragraph>
<Headline>My Headline</Headline>
<Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
</Paragraph>
<Paragraph>
<Headline>My second Headline</Headline>
<Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
</Paragraph>
</Document>
What i have found yet:
If a
w:p
-element contains aw:pPr
-element then it is always the first child-node of thisw:p
-elementIf a
w:p
-element matches on this condition./w:pPr/w:pStyle[@w:val='Heading1']>
then allw:r
-elements in thisw:p
-element belongs to the headline of the paragraph.