I have a word document with sample text from new york times like below:
Sample Text:
KABUL, Afghanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends.
When I upload the document to docx4java online demo I get the following PartsList:
<w:p w14:paraId="4CB9CEA6" w14:textId="77777777">
<w:r>
<w:t>Sample Text:</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0F399D69" w14:textId="77777777"/>
<w:p w14:paraId="0C68A7DC" w14:textId="6C93B9E5">
<w:r>
<w:t>KABUL, Af</w:t>
</w:r>
<w:r>
<w:t>g</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">hanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends. </w:t>
</w:r>
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
Note how the word Afghanistan
is broken into three different tags? I'm not sure why that happens.
I am extracting text from this docx using docx4j with the code below:
StringBuilder builder;
class DocumentTraverser extends TraversalUtil.CallbackImpl {
@Override
public List<Object> apply(Object o) {
if (o instanceof org.docx4j.wml.Text) {
builder.append(((org.docx4j.wml.Text) o).getValue());
}
return null;
}
}
Using this code builder
has the following content:
Sample Text:KABUL, Afghanistan – Secretary of State John Kerry began a series of meetings in Kabul on Friday in hopes of finding a way out of a presidential election crisis that has threatened to split the Afghan government and prompted Western officials to warn that Afghanistan risked losing billions of dollars in air on which it depends.
However, This text isn't AS-IS what the docx contains. Sample Text:KABUL
should not be one word.
Question
Is there a way to extract text from the DOCX as-is ? Meaning all the words be separated just the way they are in the original document?