0

I am using FileUtils to compare two identical pdfs. This is the code:

boolean comparison = FileUtils.contentEquals(pdfFile1, pdfFile2);

Despite the fact that both pdf files are identical, I keep getting false. I also noticed that when I execute:

byte[] byteArray = FileUtils.readFileToByteArray(pdfFile1);
byte[] byteArrayTwo = FileUtils.readFileToByteArray(pdfFile2);
System.out.println(byteArray);
System.out.println(byteArrayTwo);

I get the following bytecode for the two pdf files:

[B@3a56f631
[B@233d28e3

So even though both pdf files are absolutely identical visually, their byte-code is different and hence failing the boolean test. Is there any way to test whether the identical pdf files are identical?

blackpanther
  • 10,998
  • 11
  • 48
  • 78
  • Firstly, by comparing the `toString` outputs you essentially only can see that you have different `byte[]` **instances**, but you don't have any information on their **contents**, cf @Sleeper9's answer. That been stressed, *pdf files absolutely identical visually* may be constructed in very many different ways. Thus, comparing them byte by byte won't help in general (unless you are searching for duplicates created by file copies). – mkl Apr 28 '14 at 10:02
  • I used PdfBox to convert them into JPEGs and then compare them using FileUtils. That worked for me. – blackpanther Apr 28 '14 at 10:16
  • If by *pdf files absolutely identical visually* meant files appearing identically to the human eye, there still may be minute differences in the rendered information. Thus, even after rendering to a bitmap image format, you'll need some specialized image comparing software (which can tell you that differences are minute). (And by the way, bitmap image formats may also contain meta-information like creation time or source document name...) – mkl Apr 28 '14 at 10:20
  • For comparing the visual identity, rasterizing is the way to go. However, I would strongly advise against using a lossy image file format (such as JPEG), but use a lossless format. Otherwise, rendering artefacts can cause a non-existent difference. – Max Wyss Apr 28 '14 at 11:47
  • Another note when rasterizing: again, it is strongly recommended to take into account overprinting and Output Intents, because otherwise, you may get different appearances of exactly the same document. – Max Wyss Apr 28 '14 at 11:48

3 Answers3

1

Yes, generate md5 sum from both files.

See if these sums are identical.

If they are, then your files are identical
too with a certainty which is practically 100%.

If the sums are not identical, then
your files are different for sure.

To generate the md5 sums, on Linux there's an md5sum
command, for Windows there's a small tool called fciv.

http://www.microsoft.com/en-us/download/details.aspx?id=11533

peter.petrov
  • 38,363
  • 16
  • 94
  • 159
1

Just to note, the two identifiers you wrote

[B@3a56f631
[B@233d28e3

are different because they belong to two different objects. These are object identifiers, not bytecode. Two objects can be logically equal even if they are not exactly the same objects (e.g. they have different objectIDs).

Otherwise, calculating an MD5 checksum as peter.petrov wrote is a good idea.

Sleeper9
  • 1,707
  • 1
  • 19
  • 26
1

Unfortunately for PDF there is a big difference between having "identical files" and having files that are "visually identical". So the first question is what you are looking for.

One very simple example, information in a PDF file can be compressed or not, and can be compressed with different compression filters. Taking a file where some of the content is not compressed, and compressing that content with a ZIP compression filter for example, would give you two files that are very different on a byte level, yet very much the same visually.

So you can do a number of different things to compare PDF files:

1) If you want to check whether you have "the same file", read them in and calculate some sort of checksum as answered before by Peter Petrov.

2) If you want to know whether or know files are visually identical, the most common method is some kind of rendering. Render all pages to images and compare the images. In practice this is not as simple as it sounds and there are both simple (for example callas pdfToolbox) and complex (for example Global Vision DigitalPage) applications that implement some kind of "sameness" algorithm (caution, I'm related to both of those vendors).

So define very well what exactly you need first, then choose carefully which approach would work best.

David van Driessche
  • 6,602
  • 2
  • 28
  • 41