6

im using iText to convert xhtml to pdf. After that I am building a md5 checksum of the produced pdf to store only new/changed files.

every created file contains a PdfID0 and PdfID1 which look like hashes.

What are those "hashs" for? and how can I remove them?

im using the following code from the iText package to change the metainfos:

        com.lowagie.text.pdf.PdfReader reader = new PdfReader(pdfPath);

        com.lowagie.text.pdf.PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(tempFile));
        HashMap<String, String> hMap = reader.getInfo();          
        hMap.put("Title", "MyTitle");
        hMap.put("Subject", "Subject");
        hMap.put("Keywords", "Key, words, here");
        hMap.put("Creator", "me");
        hMap.put("Author", "me");
        hMap.put("Producer", "me");
        hMap.put("CreationDate", null);
        hMap.put("ModDate", null);
        hMap.put("DocChecksum", null);

        stamper.setMoreInfo(hMap);
        stamper.close(); 

and the extracted file metas using pdftk:

InfoKey: Creator
InfoValue: me
InfoKey: Title
InfoValue: MyTitle
InfoKey: Author
InfoValue: me
InfoKey: Producer
InfoValue: me
InfoKey: Keywords
InfoValue: Key, words, here
InfoKey: Subject
InfoValue: Subject
PdfID0: 28c71a8d7790a4d3e85ce879a90dec0
PdfID1: 4c5865d36c7a381e6166d5e362d0aafc
NumberOfPages: 1

thanks for any hints

metar
  • 434
  • 7
  • 17
  • I have the exact same problem with these IDs while generating SHA1 sums. Did you figure out how to strip/normalize this, or did you abort once you knew the info below? – mlissner Jul 09 '13 at 01:13

2 Answers2

8

What you see labelled as PdfID0 and PdfID1 by pdftk's metadata dumping is part of the following PDF trailer code at the end of the respective PDF file (example):

trailer
   << /Size 32
      /Root 24 R
      /Info 19 R
      /ID [ 
            <28c71a8d7790a4d3e85ce879a90dec0>
            <4c5865d36c7a381e6166d5e362d0aafc>
          ]
   >> startxref
81799
%%EOF

The /ID entry in the trailer dictionary is required only if an Encrypt entry is present; otherwise it's an optional key to have.

It is described by the PDF spec as:

"An array of two byte-strings constituting a file identifier (see 14.4, "File Identifiers") for the file. If there is an Encrypt entry this array and the two byte-strings shall be direct objects and shall be unencrypted."

and furthermore:

"The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version of the correct file has been found."

And it is NOT necesarrily a hash. Here is what the ISO PDF spec suggests (not "prescribes"):

"To help ensure the uniqueness of file identifiers, they should be computed by means of a message digest algorithm such as MD5 (described in Internet RFC 1321, The MD5 Message-Digest Algorithm; see the Bibliography), using the following information:

  • The current time
  • A string representation of the file’s location, usually a pathname
  • The size of the file in bytes
  • The values of all entries in the file’s document information dictionary (see 14.3.3, “Document Information Dictionary”)

There are a few more spots in generated PDF files which may change with each new run. These keys in the document information dictionary (/Info entry referenced in the trailer)

  • /CreationDate
  • /ModDate

may be updated each time you create or modify a PDF.

Therefore, using your own MD5 checksum over the produced PDF to check for new/changed files will not work, unless you make sure you at least 'normalize' the /CreationDate and /ModDate as well as the /ID entries before you create your MD5 hash.


Update: As user mkl correctly noted in a comment to this answer, the /CreationDate and /ModDate keys of the /Info dictionary (as well as the /ID info) usually have equivalent pieces of info contained in the XML metadata embedded in the PDF. You can display the complete XML metadata with the help of the pdfinfo utility like so:

pdfinfo -meta your.pdf
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • In addition to those info dictionary entries, there may be equivalent entries in the XML metadata streams. – mkl Nov 04 '12 at 09:28
  • @mkl: Thanks for the hint. (I initially skipped the fact in order to keep it simple. But I neglected to consider that this could people lead to construct too simple solutions which then wouldn't work.) I'll update my answer accordingly. – Kurt Pfeifle Nov 04 '12 at 11:10
1

Concerning the identifiers... The pdf spec says:

File identifiers shall be defined by the optional ID entry in a PDF file’s trailer dictionary (see 7.5.5, “File Trailer”). The ID entry is optional but should be used. The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version of the correct file has been found.

This, the identifiers are optional but recommended.

IText automatically inserts and updates identifiers. You can of course change iText (it's open source after all) to not do that.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • so its not possible to skip the insertion of the identifiers (eg. via property) with the available iText version? I need to modifiy it by myself to do so? – metar Nov 03 '12 at 15:55
  • Yes. Document changes without a change of the id are an unwanted behaviour after all. – mkl Nov 03 '12 at 18:29
  • If you resort to changing iText, please keep in mind that there also are numerous other entries in the pdf which are at least dependent on the time of the last document change, cf. Kurt's answer. – mkl Nov 04 '12 at 09:14