Use Python to determine if PDF was generated by Google Docs

Question

I'd like to use Python to tell if a PDF was created by Google Docs. Is there any sort of metadata I can gather with PyPDF2 to determine this?

You could try this yourself - create a PDF document in Google Docs, download it as PDF and upload it to / view it in a PDF metadata viewer. I tried, and Google Docs uses the same Skia PDF backend that Chrome uses. The `Producer` tag will be `Skia/PDF m83` (m83 is the release, so that will change). This is the same `Producer` tag that Chrome will generate when printed to PDF (m80 currently), _but_ Chrome will set the `Creator` tag to the user agent - this is not set with Google Docs. Start by looking at both files and find the tags that are present in one and not the other. — MatsLindh, Mar 26 '20 at 22:18
A service like https://www.metadata2go.com/ can tell you what metadata is associated with the file - start with that to create an heuristic, then use PyPDF2 and [DocumentInformation class](https://pythonhosted.org/PyPDF2/DocumentInformation.html) to implement it. — MatsLindh, Mar 26 '20 at 22:19
@MatsLindh: Google Docs leaves Creator blank, which is not really nice of them … I *think* they are unique in that even CreationDate and all other field are blank too; only Producer is filled in. OP might have to resort to checking this, and the result will have to be "not" or "*possibly* created by Google Docs". — Jongware, Mar 26 '20 at 22:45
Note: it seems `PyPDF2` does not like `/Producer 1557 0 R` (with `1557 0 obj (Mac OS X 10.7.5 Quartz PDFContext)`): "Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function." — Jongware, Mar 26 '20 at 23:02

score 1 · Accepted Answer · answered Mar 27 '20 at 00:26

1

When doing pdf.getDocumentInfo() on a Document created by Google Docs, it returns {'/Producer': u'Skia/PDF m83'}. I tested this on a few Google docs, and it seems to check out. It makes sense - Skia is a Google project, so must be what they use to generate documents on their backend.

So you can simply do:

import PyPDF2
GOOGLE_DOCS_PDF_METADATA = {'/Producer': u'Skia/PDF m83'}

def file_is_google_doc(pdf_file_path) 
    pdf = PyPDF2.PdfFileReader(pdf_file_path)
    return pdf.getDocumentInfo() == GOOGLE_DOCS_PDF_METADATA

answered Mar 27 '20 at 00:26

Arya

1,382
2
15
36

1

As I mentioned in my comment, be aware that this will be used by other Google projects as well - such as Google Chrome. The version number will also change over time (Chrome uses m80). – MatsLindh Mar 27 '20 at 08:55
Thanks! Didn't see your comment. – Arya Mar 27 '20 at 17:00

Use Python to determine if PDF was generated by Google Docs

1 Answers1