I use the XOM library to parse and process .docx documents. MS Word stores text content in runs (<w:r>) inside the paragraph tags (<w:p>), and often breaks the text into several runs. Sometimes every word and every space between them is in a separate run. When I load a run containing only a space, the parser removes that space and handles it as an empty tag, as a result, the output contains the text without spaces. How could I force the parser to keep all the spaces? I would prefer keeping this parser, but if there is no solution, could you recommend an alternative one?
This is how I call the parser:
StreamingPathFilter filter = new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder = new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform = new StreamingTransform() {
@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}