OpenDocument format: parse & split text by lines

Question

I'm parsing (using Groovy) the content.xml obtained from an LibreOffice .odt (Writer) file.

I want to make sure I hoover up all the text in the file, splitting by line breaks.

In Java's org.w3c.dom.Node (or Groovy's groovy.util.Node) there is a method to pick up all the text under any node (dom.Node.getTextContent/util.Node.text). For the highest node this will print all the text in the file, but it ignores line breaks.

This led me to suppose I would instead have to walk (depth-first) through the structure, identifying individual lines.

Parsing through such a structure I find that the "local part" of the nodes' names which tend to have text are "p" (paragraph) and "h" (heading).

I'm also assuming that a "p" or "h" can't nest another "p" or "h" (although with some complicated embedded structure I'm sure they can...). But clearly examining any spans under a given "p" will generate text which you've already obtained from its ancestor "p" node.

But are "p" and "h" the only QNames that I need to look at? I how should I deal with the possibility of embedded structures (e.g. a graphic containing some text).

Is there some technique whereby I can get a comprehensive listing of all text, node by node, ensuring that no text is missed out and none duplicated?

Failing this, is there some aspect of the OpenDocument format which might let me work this out? Interestingly the example in the brief overview at Wikip, under "content.xml", uses just these two QNames, "p" and "h".

Have you considered using the API provided by Apache to read the files rather than trying to invent your own? https://incubator.apache.org/odftoolkit/simple/index.html This class looks interesting https://incubator.apache.org/odftoolkit/0.6.2-incubating/simple/org/odftoolkit/simple/common/TextExtractor.html Never tried it, but might save you some time wrangling xml — tim_yates, Feb 09 '18 at 20:24

mike rodent · Accepted Answer · 2018-02-24T20:24:32.030

1

Tim Yates' comment seems the best way to go.

Unless anyone objects I shall not delete this question though because there doesn't seem another one like it.

From first experiments it appears that org.odftoolkit.simple.TextDocument.getParagraphIterator() will iterate through all paras, including "h" QNames (= headings), and also including empty paragraphs. A good sign.

NB bear in mind that these "paragraphs" may in fact be multi-line paragraphs: in a Writer file there is a difference between a "paragraph mark" and a "newline". The solution to this is very simple, however: just split the Paragraph getTextContent() / (textContent property for Groovy people) String on the newline character...

edited Feb 24 '18 at 20:24

answered Feb 09 '18 at 20:56

mike rodent

14,126
11
103
157

Yay! Fingers crossed :-D – tim_yates Feb 09 '18 at 20:56
Are you sure you don't want to make your comment into an answer? Then I'd delete mine... – mike rodent Feb 09 '18 at 20:57
No, mine was more of a passing hint – tim_yates Feb 09 '18 at 21:38

OpenDocument format: parse & split text by lines

1 Answers1