I am looking for the easiest way to see if two Office files have the same content using Python. My first instinct is to use filecmp.cmp
, but this approach logically fails, since two files do not necessarily contain the same binary information even if they have the same content.
In [10]: import win32com.client
In [11]: word = win32com.client.Dispatch("Word.Application")
In [12]: doc = word.Documents.Add()
In [13]: doc.SaveAs(FileName = "test.docx")
In [14]: doc.SaveAs(FileName = "test2.docx")
In [15]: import filecmp
In [16]: filecmp.cmp("test.docx","test2.docx")
Out[16]: False
Next, I could try to compare the content of files manually like so:
def compareWordDocs(self, worddoc1_path, worddoc2_path):
worddoc1 = self._wordapp.Documents.Open(FileName = worddoc1_path)
worddoc2 = self._wordapp.Documents.Open(FileName = worddoc2_path)
worddoc1_content_text = worddoc1.Content.Text
worddoc2_content_text = worddoc2.Content.Text
worddoc1.Close(SaveChanges = 0)
worddoc2.Close(SaveChanges = 0)
return worddoc1_content_text == worddoc2_content_text
However, this also can be an issue since Office documents can contain many things besides text. Does Microsoft offer any functionality that will let me do something like a __eq__()
or a .equals()
function, with which equality of content can be determined? I would need a solution for as many Microsoft Office products as possible, although I realize that the solution may vary from product to product given the nature of the files.