-3
public boolean compareFiles(File newFileInput,File oldFileInput) throws IOException {
    HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
    HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());
    System.out.println("HashCode New File : "+newFile +"\nHashCode Old File : "+oldFile);
    if(newFile.equals(oldFile))
    {
      return true;
    }
    else
    {
       return false;
    }
}

i have used above code to find the hashcode of two differnet docx file in order to compare them for file content,style etc. Despite of same content and style, the hashcode comes different.

any way for comparing docx files for content and style ?

backToStack
  • 101
  • 6
  • 4
    Unrelated: method names in Java should go camelCase, like variable names. And why would you want to return a STRING, instead of a boolean for such a method? – GhostCat Oct 20 '20 at 07:13
  • 1
    And note that your request asking for APIs (tools, libraries) is explicitly off topic here. – GhostCat Oct 20 '20 at 07:17
  • Check this: https://docs.aspose.com/words/java/how-to-compare-two-word-documents/ – Ankit Beniwal Oct 20 '20 at 07:22
  • @GhostCat the aim is to compare two docx files, thats all i want, the method can return boolean or string as per the requirement and how is the ask for any api is off topic, i have mentioned what i am looking for , what i tried and what i need help with... – backToStack Oct 20 '20 at 07:28
  • @ankitbeniwal checked , it compares the content fine but i need comparsion of content and style as it is, and also, i am not bothered about the differences rather, i just need MATCH or NO MATCH – backToStack Oct 20 '20 at 07:39
  • 6
    This is the fourth time you posted this question: [first](https://stackoverflow.com/questions/64383310/compare-two-documents-in-java/64383353), [second](https://stackoverflow.com/questions/64425384/issue-with-word-document-comparison-using-hashcode) and [third](https://stackoverflow.com/questions/64438979/need-java-implementation-of-docx-to-xml-conversion). At the very least you should link to previous attempts so that people understand what you already tried. On the last one I posted a suggestion in the comments that you seem to have ignored. Try to find that suggestion. – Joachim Sauer Oct 20 '20 at 08:17
  • 1
    Correction: my suggestion was in the one linked as "second". Basically: iterate over each ZIP entry using a `ZipInputStream` and hash the individual entries. If the individual file contents are identical (as indicated in the previous post), then the hashes should match. – Joachim Sauer Oct 20 '20 at 08:25
  • @JoachimSauer Thank YOu yes, all the individual xml files inside the zip are having same hashcode, i was looking for a straight and not lengthy way. I wil try out the way you have suggested. :) – backToStack Oct 20 '20 at 08:43
  • @FoggyDay thats right ,bcz it was deleted, also u have answered on the 1st one, where u asnwer "challenges" and i have replied back. and yes apache POI is not the solution, bcz i dont just have to read the content on the file, that i can do without any external lib. – backToStack Oct 21 '20 at 05:10
  • Reposting is plain wrong. Not appreciated at all. – GhostCat Oct 21 '20 at 09:32
  • @GhostCat agreed, but when you desperately need answer and you have very less time, and your question is getting deleted again and again, things happen. will take care of it, also i haven't reposted it now, i have written answer – backToStack Oct 21 '20 at 09:35
  • When you are in a hurry, slow down. Questions get deleted for a reason. What makes you think that YOUR priorities are more important than the rules and practices of this community? – GhostCat Oct 21 '20 at 09:42
  • i don't have intention of violating any rule, it said question has been deleted, repost it with clear details, this is what i did. I have recently started using this platform regularly. and i said i will take care of these rules going fwd. – backToStack Oct 21 '20 at 09:48

2 Answers2

3

Determining, whether two complex file types are equalish is always very tricky, if not impossible. DOCX contains much more than just some text and whether it is bold or not.

There are ways to make the document look exactly the same with different properties and there are also lots of metadata saved in (the author for example). It is then not just a technical problem, it is more about a philosophical problem. Let me give you an example:

You are expected to compare two cars and say, if they are the same or not. There are some obvious cases when they are objectively different, like a heavy lorry and a small city EV. But what if they are of the same type, but of a different color? Or same type, same color, but different amount of fuel in the tank?

The same goes for DOCX. Same text, but different colors? Same content, but different authors? Same … but different …?

Maybe you can disclose some more information about what you are trying to achieve, otherwise I doubt we can help you more.

If you really need to somehow compare two DOCX files (or any other types of similar complexity), find a library that can parse them and build the logic by yourself. However one might spend years doing so without a satisfactory result.

If you are more into dirty hacky solutions, use a library to build an image of the pages of the document and compare them as images. This will ensure the pages look the same. However, based on your definition of equality, that doesn't have to mean they are the same documents.

If you can choose another file format, it might be a good idea to do so. However, there will still be some tricky parts. Even Markdown (the language we use to format Q&A on this site) cannot be compared byte to byte.

This
**weird**
post

will render the same as

This **weird** post

into

This weird post.

Vojtech Kane
  • 559
  • 6
  • 21
  • that was a wonderful explaination, to answer few of your doubts, I am working on file comparison framework that takes tow files(old ,New) generated at different timestamps . On running, the test will pass only if both the files are same. By word "same", i refer to the same content/data/sentences of the file including white spaces, and also the Style of the content like the fonts, font-size etc..... everything should match in both files – backToStack Oct 20 '20 at 08:53
  • I have done comparison for .TXT files using md5, it works fine since txt file contains just normal texts without any formating or styling but hashcode comparsion won't work in case of docx since it has different things to mismatch apart from the content, as you explained in your answer and so i am looking for any library that would do the job. – backToStack Oct 20 '20 at 08:54
0

There were multiple suggestions and answers to my question, thanks for that. The reasons for mismatch in docx file is there in the metadata info, everytime we create a doc/docx file, the timestamp changes. Though i tried to change the timestamp(accessed,modified and created) of both the files to make it same and compare, which didn't work out. The reason is apart from these time stamps, there is a meta info called Zip Modify Date, which isn't visible when we see the file properties. this timestamp i found as one of the reason there was mismatch in hashcode. Also, the base64 encoded strings was different because of the zip timestamp.

So, the options i had to do the comparison were :

  1. converting the docx file to xml file
  2. Zip the docx file, unzip it and iterate though all the xml files to find the
    hashcode and compare the hascodes.(suggeted as of the answers)

"2" was good but it required lot of iterations and unzipping would create lot many folders.

"1" , was straight fwd, as i tried it using external lib -> docx4j , which converted the docx to xml and then i could match the hashcode , it worked.

Convert DOCX to XML file

I had to try different options since i was looking for simplest and not so complex way to compare content and styles of the word document.

backToStack
  • 101
  • 6