8

How can I read word-by-word (with styles) from a docx file. I want to compare two docx files word-by-word and based on the differences I have to write into another docx file (using c# and OOXML). I have tried achieving this by using DocumentFormat.OpenXml.Extensions.dll, OpenXMLdiff.dll and ICSharpCode.SharpZipLib.dll but nothing is giving me the option to read word-by-word(ICSharpCode.SharpZipLib does give word-by-word but it will not give style associated with that word).

Any help on this will be very useful.

Todd Main
  • 28,951
  • 11
  • 82
  • 146
user274223
  • 81
  • 1
  • 3

2 Answers2

3

This MSDN article shows how to reliably retrieve the exact text of a document, paragraph by paragraph.

http://msdn.microsoft.com/en-us/library/ff686712.aspx

At the same time, you can determine the style for each paragraph. That is pretty easy. The following blog post shows how to retrieve the style and text for each paragraph:

http://blogs.msdn.com/b/ericwhite/archive/2009/02/16/finding-paragraphs-by-style-name-or-content-in-an-open-xml-word-processing-document.aspx

Comparing the two? It depends on your exact desired semantics. One approach would be to create an XML document that contains paragraphs and styles, then comparing the XML documents. The XML document might look something like this:

<Root>
  <Para>
    <Style>Normal</Style>
    <Text>This is the text of the paragraph.</Text>
  </Para>
  <Para>
    <Style>Heading1</Style>
    <Text>Overview of the Process</Text>
  </Para>
</Root>
Eric White
  • 1,851
  • 11
  • 14
0

The easiest way is to just unzip the DOCX file using your favorite ZIP library and then compare the text files with a file IO library.

Zian Choy
  • 2,846
  • 6
  • 33
  • 64