Any library that have inbuilt methods to get exact page number and line number of an element in word document

Question

I am looking for a library in java that can read all data from a word document element by element including paragraph,tables, charts, images and comments etc. and having the track of current page number and line number where the element is starting using inbuit methods. i tried with apache-poi, aspose and docx4j.I didn't find any such methods to get line number. if there are any please let me know

score 2 · Answer 1 · answered Feb 23 '23 at 12:03

There is a reason that you did not find methods to get line numbers and/or division into pages of a Word document. The reason is that this is not possible without rendering the Word document.

A Word document body consists of a stream of body elements, such as paragraphs, tables, pictures, diagrams, ... Technically there is no page concept in a Word document. Microsoft Word creates pages on the fly while rendering. The same is for text lines. Microsoft Word reformats line breaks on the fly while rendering.

One body element follows the element before it. How much height a body element claims depends on multiple factors, such as font family, font size, margins, paddings, vertical indents, special height settings, ... . Therefore, without rendering one cannot know how much vertical space a body element needs and whether a page is full.

The same is for text lines. How much width a body element claims depends on multiple factors, such as font family, font size, margins, paddings, horizontal indents, special width settings, ... . Therefore, without rendering one cannot know how much space a body element needs and whether the right page margin is reached to make a line break necessary.

There may be explicit page breaks set and there may be explicit line breaks set. However, that is not necessary and so one cannot rely on those as there may be automatic line breaks and automatic page breaks too, dependent on the rendering.

Microsoft Word normally marks page breaks when a document was rendered and saved then. However, this also is nothing what I would rely on for a *.docx as it could have been created using other word-processing applications, which don’t do so.

Apache POI is not internally rendering Word documents. So no chance using this library.

At least Aspose.Words should have an internally rendering engine and so be able to get at least pages. At least it claims so. However, I do not know much about Aspose code.

score 1 · Accepted Answer · answered Feb 23 '23 at 12:05

MS Word documents are flow documents and does not contain any information about document layout. The consumer applications like MS Word or Open Office build the document layout on the fly. Aspose.Words has it’s own document layout engine. The facade classes LayoutCollector and LayoutEnumerator allows to get layout information of document elements. For example to determine page index where a particular node starts or ends you can use LayoutCollector.getStartPageIndex and LayoutCollector.getEndPageIndex correspondingly:

Document doc = new Document("C:\\Temp\\in.docx");
LayoutCollector collector = new LayoutCollector(doc);
// Get some node, for example the first paragraph is in the document.
Paragraph para = doc.getFirstSection().getBody().getFirstParagraph();
System.out.println("Paragraph starts on " + collector.getStartPageIndex(para) + " page.");

See Aspose.Words for Java GitHub for more code examples

Thanks for the response, How can we extract the x-axis and y-axis values from the chart. i am able to get chart title, series values. Are there any methods to get the x-axis data? — lemon chow, Feb 23 '23 at 12:16

Any library that have inbuilt methods to get exact page number and line number of an element in word document

2 Answers2