1

For a Paragraph object, how can I determine on which page this is located using the Open XML SDK 2.5 ?

I've obtained all child elements in my document and fetched innertext also, using this.

   foreach (var i in mainPart.Document.ChildElements.FirstOrDefault().ChildElements)
        {
            ParagraphElements.Add(i); //openxmlelement list
        }

I want to get page number for corresponding paragraph. for example, I have "this is heading 1" marked as style Heading 1 and this will be updated in TOC. so there I need to pass page number

Thanks in advance

Pavithran R
  • 56
  • 2
  • 11

3 Answers3

7

Pages do not exist in the OpenXML format until they are rendered by a word processor.

The metadata necessary to calculate calculate on which page a given paragraph should appear is available, but it is far from a straightforward operation.

To verify that page numbers do not exist in the raw OpenXML markup:

  1. Rename a copy of your Word document ending with ".docx" to end with ".zip".
  2. Within this zip archive, open the sub-directory named "word".
  3. Within "word" open "document.xml".

This file is contains the XML content of your mainPart.Document call. The "document.xml" file has a single node, <document>...</document>, which has in turn a single child node, <body>...</body>, which in turn holds the content in which you're interested.

When working with OpenXML documents, I find that the abstractions in the OpenXML SDK can sometimes be distracting. Thankfully, its simple to explore the raw markup with LINQ-to-XML. For example, your call to:

var childrenFromOpenXmlSdk = mainPart.Document.ChildElements.Single().ChildElements;

is equivalent to the following in LINQ-to-XML:

IEnumerable<XElement> childrenFromLinqToXml = 
    XElement.Load("[path]/[file]/word/document.xml")
            .Elements()
            .Single()
            .Elements();`

Inspecting the elements in the childrenFromLinqToXml you'll find no page number information.

You may see cached page numbers in the raw markup of the TOC itself, but these will be artifacts of the previous rendering, defined by content tags or form fields.

If you need to build up the TOC programmatically, have a look at the following sites:

  1. OfficeOpenXML.com's reference article for TOCs

    • This is a helpful reference for the ECMA-376 specification of OpenXML.
  2. Eric White's screencast "Exploring Tables-of-Contents in Open XML WordprocessingML Documents"

    • Eric White is a leading authority on all things OpenXML. His ericwhite.com/blog is well-worth a look when you find yourself at the intersections of XML markup and on-screen rendering.

--- Following up on the Sai's comments ---

Hi Austin Drenski, I've created TOC and added all headings programmatically. all I need is page numbers. is there any alternative to get page number of particular paragraph ? I've gone through all the screen casts. But I'm looking for page number alone.

<w:r> <w:fldChar w:fldCharType="begin" /> </w:r> <w:r> <w:instrText xml:space="preserve"> PAGEREF _Toc481680509 \h </w:instrText> </w:r> <w:r> <w:fldChar w:fldCharType="separate" /> </w:r> <w:r> <w:t>2</w:t> </w:r> <w:r> <w:fldChar w:fldCharType="end" /> </w:r>

In that sample XML 2 "2" act as page number. That is hardcoded

now my TOC works perfectly without Pagenumber. where I also analysed default MS word functionality. First time, page numbers are literally given like above.

You can programmatically place a content control <w:sdt> in the document, as a child of the <w:body> element.

For a simple TOC with two entries:

<w:sdt>
    <w:sdtPr>
        <w:id w:val="429708664"/>
        <w:docPartObj>
            <w:docPartGallery w:val="Table of Contents"/>
            <w:docPartUnique/>
        </w:docPartObj>
    </w:sdtPr>
    <w:sdtContent>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="TOCHeading"/>
            </w:pPr>
            <w:r>
                <w:t>Contents</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="TOC1"/>
                <w:tabs>
                    <w:tab w:val="right" w:leader="dot" w:pos="9350"/>
                </w:tabs>
            </w:pPr>
            <w:r>
                <w:fldChar w:fldCharType="begin"/>
            </w:r>
            <w:r>
                <w:instrText xml:space="preserve"> TOC \o "1-3" \h \z \u </w:instrText>
            </w:r>
            <w:r>
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
            <w:hyperlink w:anchor="_Toc481654079" w:history="1">
                <w:r>
                    <w:rPr>
                        <w:rStyle w:val="Hyperlink"/>
                    </w:rPr>
                    <w:t>Testing 1</w:t>
                </w:r>
                <w:r>
                    <w:tab/>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="begin"/>
                </w:r>
                <w:r>
                    <w:instrText xml:space="preserve"> PAGEREF _Toc481654079 \h </w:instrText>
                </w:r>
                <w:r>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="separate"/>
                </w:r>
                <w:r>
                    <w:t>0</w:t>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="end"/>
                </w:r>
            </w:hyperlink>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="TOC1"/>
                <w:tabs>
                    <w:tab w:val="right" w:leader="dot" w:pos="9350"/>
                </w:tabs>
            </w:pPr>
            <w:hyperlink w:anchor="_Toc481654080" w:history="1">
                <w:r>
                    <w:rPr>
                        <w:rStyle w:val="Hyperlink"/>                                
                    </w:rPr>
                    <w:t>Testing 2</w:t>
                </w:r>
                <w:r>
                    <w:tab/>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="begin"/>
                </w:r>
                <w:r>
                    <w:instrText xml:space="preserve"> PAGEREF _Toc481654080 \h </w:instrText>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="separate"/>
                </w:r>
                <w:r>
                    <w:t>0</w:t>
                </w:r>
                <w:r>
                    <w:fldChar w:fldCharType="end"/>
                </w:r>
            </w:hyperlink>
        </w:p>
        <w:p>
            <w:r>
                <w:fldChar w:fldCharType="end"/>
            </w:r>
        </w:p>
    </w:sdtContent>
</w:sdt>

Note the use of PAGEREF field codes pointing at bookmarks. Also note the subsequent markup <w:t>0</w:t>. When the document is opened and the field codes are updated, this zero will be replaced by the page number on which the bookmark is currently rendered.

Each time the document is paginated, the exact placement of a bookmark could change.

Once the zeros are replaced with instance-numbers, you will observe those instance-numbers in the markup. However, these numbers are simply the last rendered values for those field codes.

In the document settings, you can prompt the user to update field codes upon opening, so that the TOC numbers will accurately reflect the current on-screen rendering. To do so, your settings file should resemble:

<w:settings ...namespaces ommitted...>
    <w:updateFields w:val="true"/>
    ...other settings ommitted...
</w:settings>

In the end, you still need to render the OpenXML document with a word processor, but you avoid the complexity of calculating page positions.

Austin Drenski
  • 506
  • 4
  • 10
  • Hi Austin Drenski, I've created TOC and added all headings programmatically. all I need is page numbers. is there any alternative to get page number of particular paragraph ? I've gone through all the screen casts. But I'm looking for page number alone. – Pavithran R May 04 '17 at 13:17
  • PAGEREF _Toc481680509 \h 2 – Pavithran R May 04 '17 at 13:18
  • In that sample XML 2 "2" act as page number. That is hardcoded. – Pavithran R May 04 '17 at 13:22
  • now my TOC works perfectly without Pagenumber. where I also analysed default MS word functionality. First time, page numbers are literally given like above. – Pavithran R May 04 '17 at 13:38
  • I've edited my answer to show how the TOC content control + `PAGEREF` fields are structured prior to the word processor rendering the document. – Austin Drenski May 04 '17 at 14:09
  • Thanks for the prompt response @Austin But if we use update fields on open. we will get a pop up when document is opened for first time, that content is changed would you like to update?. this won't be appreciated by end user. So, I even tried to suppress the pop up. But it will cause some issues. I'm Looking for pagenumber particularly. so I can directly map it to corresponding TOC entries :) – Pavithran R May 04 '17 at 15:40
0

After a lot of ground work, found that, page number cannot be retrieved using openxml element. We can approximate it. But we cannot be sure. Because Page numbers are rendered by word processor layout engine. This happens after all the OpenXML elements are passed to word processor. We can calculate it with LastRenderedPageBreak. But we cannot be sure that location of the element is correct.

So, I would suggest to go with UpdateFieldsOnOpen or Macro for an easier solution.

Pavithran R
  • 56
  • 2
  • 11
0

Get the current page number

            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Begin,
            });
            runs.Append(new FieldCode(@" PAGE \* MERGEFORMAT ")
            {
                Space = SpaceProcessingModeValues.Preserve
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Separate
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.End
            });

Get total page number

            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Begin,
            });
            runs.Append(new FieldCode(@" NUMPAGES \* MERGEFORMAT ")
            {
                Space = SpaceProcessingModeValues.Preserve
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Separate
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.End
            });

Complete code

            Run runs = new Run();
            runs.Append(new Text("第"));
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Begin,
            });
            runs.Append(new FieldCode(@" PAGE \* MERGEFORMAT ")
            {
                Space = SpaceProcessingModeValues.Preserve
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Separate
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.End
            });
            runs.Append(new Text("页/共"));
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Begin,
            });
            runs.Append(new FieldCode(@" NUMPAGES \* MERGEFORMAT ")
            {
                Space = SpaceProcessingModeValues.Preserve
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.Separate
            });
            runs.Append(new FieldChar
            {
                FieldCharType = FieldCharValues.End
            });
            runs.Append(new Text("页"));
            paragraph.Append(runs);

            footer1.Append(paragraph);

            part.Footer = footer1;

word show enter image description here