2

I want to manually verify the integrity of a signed pdf. I have been able to reach at:-

  • got the value of '/Content' node from pdf(using PyPDF2). This is a der encoded PKCS#7 certificate.

Now as per pdf specifications, the message digest of the pdf data is stored along with the certificate in /Content node. Tried a lot but I am not able to get the digest value which I would eventually compare with hashed pdf content(specified by /ByteRange).

  • PDF specification snapshot:- snap

Don't understand the last part that says write signature object data into the dictionary. where does this write actually happens and how can I extract the message digest?

Barun Sharma
  • 1,452
  • 2
  • 15
  • 20
  • @BrunoLowagie...sorry. will edit the question. This is der encoded PKCS#7. – Barun Sharma Feb 09 '15 at 11:13
  • 1
    *got the value of '/Content' node from pdf(using PyPDF2). This is a der encoded PKCS#7 certificate* - **A** It should be a PKCS#7 / CMS *signature container* (as long as the **SubFilter** is **ETSI.CAdES.detached**, **adbe.pkcs7.detached**, or **adbe.pkcs7.sha1**) which may *contain* certificates as an additional payload; the more important part is the SignerInfo, though... **B** While according to spec this signature object shall be DER encoded, there are a numerous PDFs in the wild with signature objects whose outer layers merely are BER encoded. – mkl Feb 09 '15 at 13:36
  • @mkl Thanks for the inputs. Yes here the /Subfilter is `adbe.pkcs7.detached`. Now the reason for my statement is:- **A)**I got the contents of '/Content' node and wrote it to a file-->used `openssl pkcs7` command line utility to convert this to pem which is readable. I am pretty new to using crypto and digital certs. Can you help me figuring out how I can extract this message digest from the certificate and use this for my manual check? **B)**outer layed->do you mean the container? – Barun Sharma Feb 10 '15 at 03:50
  • *Can you help me figuring out* - Unfortunately I hardly ever use openssl and I don't have any python experience at all. All I can try to do is explain the lowlevel objects. *how I can extract this message digest from the certificate and use this for my manual check* - But you don't want to extract a message digest from a **certificate**. You want to extract it from a **signature container**. *outer layed->do you mean the container* - Yes. But most likely you won't perceive the different encodings anyway, most tools handle both. – mkl Feb 10 '15 at 08:48
  • @mkl Ok Thanks. That makes sense. Infact I am also carving for some lowlevel details. You are right, I need to extract the digest from signature container(Actually I got confused reading the pdf specification). Anyways, can you help me figuring out where is this signature container and how do I parse it to get the message digest. Till now I thought that the message digest is embedded somewhere in the digital signature(as I understood from pdf specifications). – Barun Sharma Feb 10 '15 at 09:07
  • @mkl I have made some edit to question. Please check. – Barun Sharma Feb 10 '15 at 09:36
  • *PDF specification snapshot* - that is not from the PDF specification. The currently normative PDF specification in ISO 32000-1:2008 (a part 2, i.e. ISO 32000-2, is being worked on) a copy of which is provided by Adobe [here](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) *for those that do not need the official version containing the ISO logo and copyright notices.* Your snapshot looks like being taken from some document showing Adobe-specific stuff. – mkl Feb 10 '15 at 09:42
  • @mkl A real good pointer. Yes I have some progress in my understanding. Thanks for bearing my stupidity if any. Now in this document says:- *1)* `/Content`-->The signature value. When ByteRange is present, the value shall be a hexadecimal string (see 7.3.4.3, “Hexadecimal Strings”) representing the value of the byte range digest. For public-key signatures, Contents should be either a DER-encoded PKCS#1 binary data object or a DER-encoded PKCS#7 binary data object. – Barun Sharma Feb 10 '15 at 10:00
  • @mkl Also, *2)* `/Cert`--> If SubFilter is adbe.pkcs7.detached or adbe.pkcs7.sha1, this entry shall not be used, and the certificate chain shall be put in the PKCS#7 envelope in Contents. Combining the two, I would say `/Contents` will have the PKCS#7 envelope for certificate chain. But then where does the Message Digest go? :( Am I missing some link? – Barun Sharma Feb 10 '15 at 10:02
  • How did you get the value of '/Content' node from pdf using PyPDF2? – Shahbaz Khan Dec 29 '20 at 14:30

1 Answers1

2

(This is more a comment than an answer. Due to the size and formatting restrictions of comments, I put it into an answer nonetheless.)

A signature in a PDF

In a prior question the OP already inserted a sketch illustrating a signature embedded in a PDF in case of SubFilter ETSI.CAdES.detached, adbe.pkcs7.detached, or adbe.pkcs7.sha1:

Figure 3 Digital ID and a signed PDF document

But this is merely a sketch, and interpreting it too literally may leave the incorrect impression that the value of the Contents entry in the signature dictionary is something like a list containing a "Certificate", a "Signed message digest" and a "Timestamp". Furthermore calling this list the "Signature value" can also confuse as that name is also used for a small part of the content, see below.

The actual content is specified (cf. this document) as:

When PKCS#7 signatures are used, the value of Contents shall be a DER-encoded PKCS#7 binary data object containing the signature. The PKCS#7 object shall conform to RFC3852 Cryptographic Message Syntax.

(As an aside: While the specification here requires the data object to be DER-encoded, there are many signed PDFs in the wild which use some much less strict BER-encoding for the object as a whole and DER only for parts also required by RFC3852 to be DER-encoded.)

The PKCS#7 binary data object

The PKCS#7 binary data object containing the signature conforming to RFC3852 more exactly is a ContentInfo object with a SignedData content, often named a "signature container".

According to RFC 3852

The CMS associates a content type identifier with a content. The syntax MUST have ASN.1 type ContentInfo:

  ContentInfo ::= SEQUENCE {
    contentType ContentType,
    content [0] EXPLICIT ANY DEFINED BY contentType }

The signed-data content type shall have ASN.1 type SignedData:

  SignedData ::= SEQUENCE {
    version CMSVersion,
    digestAlgorithms DigestAlgorithmIdentifiers,
    encapContentInfo EncapsulatedContentInfo,
    certificates [0] IMPLICIT CertificateSet OPTIONAL,
    crls [1] IMPLICIT RevocationInfoChoices OPTIONAL,
    signerInfos SignerInfos }

Here you see the optional collection certificates in which usually at least the signer certificate and often also its chain of issuer certificates are contained. Here is the "Certificate" from the sketch above.

You also see the signerInfos structure which contains actual signing information:

  SignerInfos ::= SET OF SignerInfo

Per-signer information is represented in the type SignerInfo:

  SignerInfo ::= SEQUENCE {
    version CMSVersion,
    sid SignerIdentifier,
    digestAlgorithm DigestAlgorithmIdentifier,
    signedAttrs [0] IMPLICIT SignedAttributes OPTIONAL,
    signatureAlgorithm SignatureAlgorithmIdentifier,
    signature SignatureValue,
    unsignedAttrs [1] IMPLICIT UnsignedAttributes OPTIONAL }
  SignedAttributes ::= SET SIZE (1..MAX) OF Attribute
  Attribute ::= SEQUENCE {
    attrType OBJECT IDENTIFIER,
    attrValues SET OF AttributeValue }

(Here you see the structure the RFCs call the SignatureValue... as already mentioned, the sketch above calling the whole signature container "Signature value" can confuse as down here already is an entity of a type called like that.)

You are after the message digest of the signed PDF byte ranges for a adbe.pkcs7.detached type PDF signature. There actually are two possibilities:

  • In the rare case of the most simple SignerInfo instances, there are no SignedAttributes. In this case the SignatureValue is the value of a signature algorithm immediately applied to the signed byte ranges.

If the signature algorithm is based on RSA, you can retrieve the document digest value by decoding the value using the signer's public key (from his certificate) and extracting the digest from the decoded DigestInfo object.

    DigestInfo ::= SEQUENCE {
      digestAlgorithm DigestAlgorithmIdentifier,
      digest Digest }

If the signature algorithm is based on DSA or EC DSA, you cannot retrieve the digest value at all. These algorithm only allow you to check whether a digest value you provide (e.g. having hashed the signed byte range of the document as you have retrieved it) is the originally signed one.

  • In the far more common case of SignerInfo instances with SignedAttributes, you have to search these SignedAttributes for the message digest attribute which is identified by
 id-messageDigest OBJECT IDENTIFIER ::= { iso(1) member-body(2)
        us(840) rsadsi(113549) pkcs(1) pkcs9(9) 4 }

As already mentioned in comments, though, I cannot explain how to drill down here using Python or openssl. You will need some tool which knows these specific ASN.1 structures or ASN.1 structures in general.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Superb explanation. Thanks for the detailed analysis. This serves kind of answer to me...Will update once I am actually done with all the analysis. – Barun Sharma Feb 10 '15 at 11:21
  • Read extracted PKCS#7 with openssl: `openssl asn1parse -in signature.bin -inform der` repectively `openssl pkcs7 -in signedHashCert.bin -inform der -text` – hengsti Feb 01 '18 at 09:49