1

I am looking for the easiest way to see if two Office files have the same content using Python. My first instinct is to use filecmp.cmp, but this approach logically fails, since two files do not necessarily contain the same binary information even if they have the same content.

In [10]: import win32com.client

In [11]: word = win32com.client.Dispatch("Word.Application")

In [12]: doc = word.Documents.Add()

In [13]: doc.SaveAs(FileName = "test.docx")

In [14]: doc.SaveAs(FileName = "test2.docx")

In [15]: import filecmp

In [16]: filecmp.cmp("test.docx","test2.docx")
Out[16]: False

Next, I could try to compare the content of files manually like so:

def compareWordDocs(self, worddoc1_path, worddoc2_path):

    worddoc1 = self._wordapp.Documents.Open(FileName = worddoc1_path)
    worddoc2 = self._wordapp.Documents.Open(FileName = worddoc2_path)

    worddoc1_content_text = worddoc1.Content.Text
    worddoc2_content_text = worddoc2.Content.Text

    worddoc1.Close(SaveChanges = 0)
    worddoc2.Close(SaveChanges = 0)

    return worddoc1_content_text == worddoc2_content_text

However, this also can be an issue since Office documents can contain many things besides text. Does Microsoft offer any functionality that will let me do something like a __eq__() or a .equals() function, with which equality of content can be determined? I would need a solution for as many Microsoft Office products as possible, although I realize that the solution may vary from product to product given the nature of the files.

zx81
  • 41,100
  • 9
  • 89
  • 105
user3846506
  • 196
  • 1
  • 1
  • 11
  • 1
    What is your definition of *content*? Images? Formatting? Tables? Whitespace? Headers? – Gareth Latty Jul 31 '14 at 19:51
  • 1
    Quite honestly unless you're comparing hundreds of documents I would use Word's built-in comparison tool. – James Mertz Jul 31 '14 at 19:55
  • Have you tried https://docs.python.org/2/library/difflib.html ? – Julien Palard Aug 01 '14 at 06:09
  • @Lattyware Good question. By content, I mean what the user sees when they open a file in the appropriate program. However, I would be willing to take shortcuts, such as simply verifying there is an image of the same size in the same place in two 'equivalent' office files. – user3846506 Aug 02 '14 at 14:48
  • @KronoS I am comparing hundreds of word documents. I need to verify that content is saved correctly by office programs (through the native SaveAs method) to a server in a test harness, so I need to compare the local content with the remote content. – user3846506 Aug 02 '14 at 14:48
  • @julien-palard Thank you for this suggestion. I have tried this a little, but I am unsure how to verify that two documents are the same with it. Just parse the results making sure no strings begin with '-' or '+'? Also, is it able to handle binary sequences like what would be in .doc files? – user3846506 Aug 02 '14 at 14:51

0 Answers0