docx4j Differencer Showing More Differences Than Expected

Question

I have two documents:

Document 2 is the result of passing Document 1 through a transformation process which leaves any content and formatting intact (verified by side-by-side compare in Word).

However, the process removes many id numbers from the .docx files.

For example,

      <w:p w:rsidP="00B600D6" w:rsidR="00F55D78" w:rsidRDefault="00B600D6">

becomes

      <w:p>

according to a dump of each document via the following code:

Body body = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
Node node = org.docx4j.XmlUtils.marshaltoW3CDomDocument(body).getDocumentElement();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(new DOMSource(node), 
             new StreamResult(new OutputStreamWriter(System.out, "UTF-8")));

Using the docx4j Differencer comparison method recommended here, everything (except the first line which has no formatting applied) is shown as a modification.

Question is: Are the diffs a result of the missing id's, the formatting or something else?

In case it's important, we're using docx4j in this context to perform automated sanity/regression tests on our round-tripping proceess (i.e. apply the "loss-less" process and expect no differences)

Are you getting this message from Google Docs? "Sorry, we are unable to generate a view of the document at this time. Please try again later. You can also try to download the original document by clicking here." If so, you can use the linked "here" text to get the docx. I double-checked the sharing and it seems ok. — Jacob Zwiers, Jul 11 '12 at 17:56

score 0 · Accepted Answer · answered Jul 11 '12 at 12:48

0

Disclosure: I work on docx4j

If the only difference between paragraphs is the rsid attributes, they will still be detected as different.

You could "clean" the documents before performing the comparison, so that neither docx has rsid attributes. See the Filter sample.

By the way, an easier way to see the XML for an object (eg a single paragraph, or the entire body) is to use XmlUtils.marshaltoString

answered Jul 11 '12 at 12:48

JasonPlutext

15,352
4
44
84

Cleaning the documents gave an accurate comparision. Thanks for that, Jason.... and the tip on the `XMLUtils.marshalToString()` :-) – Jacob Zwiers Jul 11 '12 at 21:49

docx4j Differencer Showing More Differences Than Expected

1 Answers1