0

I have a word document with sample text from new york times like below:

Sample Text:

KABUL, Afghanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends.

When I upload the document to docx4java online demo I get the following PartsList:

<w:p w14:paraId="4CB9CEA6" w14:textId="77777777">
                    <w:r>
                        <w:t>Sample Text:</w:t>
                    </w:r>
                </w:p>
                <w:p w14:paraId="0F399D69" w14:textId="77777777"/>
                <w:p w14:paraId="0C68A7DC" w14:textId="6C93B9E5">
                    <w:r>
                        <w:t>KABUL, Af</w:t>
                    </w:r>
                    <w:r>
                        <w:t>g</w:t>
                    </w:r>
                    <w:r>
                        <w:t xml:space="preserve">hanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends. </w:t>
                    </w:r>
                    <w:bookmarkStart w:name="_GoBack" w:id="0"/>
                    <w:bookmarkEnd w:id="0"/>
                </w:p>

Note how the word Afghanistan is broken into three different tags? I'm not sure why that happens.

I am extracting text from this docx using docx4j with the code below:

StringBuilder builder;
class DocumentTraverser  extends TraversalUtil.CallbackImpl {
    @Override
    public List<Object> apply(Object o) {
        if (o instanceof org.docx4j.wml.Text) {
            builder.append(((org.docx4j.wml.Text) o).getValue());
        }
        return null;
    }
}

Using this code builder has the following content:

Sample Text:KABUL, Afghanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends.

However, This text isn't AS-IS what the docx contains. Sample Text:KABUL should not be one word.

Question

Is there a way to extract text from the DOCX as-is ? Meaning all the words be separated just the way they are in the original document?

Anthony
  • 33,838
  • 42
  • 169
  • 278

1 Answers1

2

You should encorporate paragraph breaks: </w:p>. As I do not have docx4j on my machine, the following is more of an idea:

public List<Object> apply(Object o) {
    if (o instanceof org.docx4j.wml.Text) {
        builder.append(((org.docx4j.wml.Text) o).getValue());
    } else if (o instanceof Element)
        && ((Element) o).getTagName().equals("w:p") {
        builder.append("\n");
    }
    return null;
}

This adds a linefeed at the beginning of a paragraph; but look how you may improve that.

By the way, check for <w:t> only, as there is also special command Text.

Also page breaks ("\f") may be added on <w:lastRenderedPageBreak>.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138