0

Please have a look on the below issue.

1 - Applying the MD5 on a .txt file containing "Hello" (without quotes, length = 5). It gives some hash value (say h1).
2 - Now file content are changed to "Hello " ( without quotes, length = 6). It gives some hash value (say h2).
3 - Now file is changed to "Hello" (exactly as step. 1). Now the hash is h1. Which makes sense.

Now the problem comes if procedure is applied to a .pdf file. Here rather than changing the file content I am chaging the colour of the text and again reverting back to the original file. In this way i am getting three different hash values.

So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?

Info:- Using a freeware in windows to calculate the hash.

Sudhansu
  • 190
  • 1
  • 4
  • 17
  • 1
    Hi, as a wild guess, I would say that the PDF file contains meta data such as last edit time etc so that would be part of the file that makes up the hash value – KevInSol Mar 03 '15 at 07:35
  • Hi Kevln, generally time information is kept as file meta data i.e. in separate data structures (output of stat command). Here original file was formatted and again changed to original condition. So the meta data with respect to the formatting should have been deleted. Is this understanding right? – Sudhansu Mar 03 '15 at 12:59
  • Hi, as I said it was a wild guess, I don't really know much about PDF, nor do I have the ability to write them. But, I've just opened a random file in Adobe Reader V11 and in doc properties (ctrl D) it gives created and modified times. I would guess that you are changing the modifies time and that must be stored somewhere within the PDF and thus changing the hash, even if the actual text/formating is reverted to your initial condition. – KevInSol Mar 03 '15 at 13:14
  • On a Word doc: MD5 (polling.doc) = f22784408dc39c4727d58b448daee198, then put in one space, then backspac & save : MD5 (polling.doc) = e84e71698ae2c4431075ae36c6a91dbc – KevInSol Mar 03 '15 at 13:19
  • In that case both word and adobe reader are storing timing information which notepad does not. And that is why the difference is visible. This seems to be the right answer. – Sudhansu Mar 03 '15 at 13:24
  • I assume you want to do this to track if a doccment has been changed? a workaround may be to use one of the utilities to extract just the text (such as http://www.foolabs.com/xpdf/download.html), and compare the text output. – KevInSol Mar 03 '15 at 14:05

2 Answers2

6

So, is it because of the way pdf reader encode the text and meta-data, hash is different or the analogy itself is wrong?

Correct. If you need to test this on your own data, open any PDF in a text editor (I use Notepad++) and scroll to the bottom (where metadata is stored). You'll see something akin to:

<</Subject (Shipping Documents)
/CreationDate (D:20150630070941-06'00')
/Title (Shipping Documents)
/Author (SomeAuthor)
/Producer (iText by lowagie.com \(r0.99 - paulo118\))
/ModDate (D:20150630070941-06'00')
>>

Obviously, /CreationDate and ModDate at the very least will continue to change. Even if you re-generate a pdf from some source, with identical source data, those timestamps meaningfully change the checksum of the target pdf.

Tor
  • 1,522
  • 3
  • 16
  • 26
1

Correct, PDFs which look exactly the same can have the same checksum because of some metadata stored in the file like ModDate. I needed to detect PDFs which look the same, so I wrote a kinda-hacky Javascript function. This isn't guaranteed to work, but at least it detects duplicates some of the time (normal checksums will rarely detect duplicate pdfs). You can read more about the PDF format here https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf and see some similar solutions in this related SO question Why does repeated bursting of a multi-page PDF into individual pages via pdftk change the md5 checksum of those pages?

    /**
     * The PDF format is weird, and contains various header information and other metadata.
     * Most (all?) actual pdf contents appear between keywords `stream` and `endstream`.
     * So, to ignore metadata, this function just extracts any contents between "stream" and "endstream".
     * This is not guaranteed to find _all_ contents, but it _should_ ignore all metadata.
     * Useful for generating checksums.
     */
    private getRawContent(buffer: Buffer): string {
        const str = buffer.toString();
        // FIXME: If the binary stream itself happens to contain "endstream" or "ModDate", this wont work.
        const streamParts = str.split('endstream').filter(x => !x.includes('ModDate'));
        if (streamParts.length === 0) {
            return str;
        }
        const rawContent: string[] = [];
        for (const streamPart of streamParts) {
            // Ignore everything before the first `stream`
            const streamMatchIndex = streamPart.indexOf('stream');
            if (streamMatchIndex >= 0) {
                const contentStartIndex = streamMatchIndex + 'stream'.length;
                const rawPartContent = streamPart.substring(contentStartIndex);
                rawContent.push(rawPartContent);
            }
        }
        return rawContent.join('\n');
    }
Brian
  • 11
  • 2