I'd like to use Python to tell if a PDF was created by Google Docs. Is there any sort of metadata I can gather with PyPDF2 to determine this?
Asked
Active
Viewed 2,350 times
-1
-
2You could try this yourself - create a PDF document in Google Docs, download it as PDF and upload it to / view it in a PDF metadata viewer. I tried, and Google Docs uses the same Skia PDF backend that Chrome uses. The `Producer` tag will be `Skia/PDF m83` (m83 is the release, so that will change). This is the same `Producer` tag that Chrome will generate when printed to PDF (m80 currently), _but_ Chrome will set the `Creator` tag to the user agent - this is not set with Google Docs. Start by looking at both files and find the tags that are present in one and not the other. – MatsLindh Mar 26 '20 at 22:18
-
1A service like https://www.metadata2go.com/ can tell you what metadata is associated with the file - start with that to create an heuristic, then use PyPDF2 and [DocumentInformation class](https://pythonhosted.org/PyPDF2/DocumentInformation.html) to implement it. – MatsLindh Mar 26 '20 at 22:19
-
@MatsLindh: Google Docs leaves Creator blank, which is not really nice of them … I *think* they are unique in that even CreationDate and all other field are blank too; only Producer is filled in. OP might have to resort to checking this, and the result will have to be "not" or "*possibly* created by Google Docs". – Jongware Mar 26 '20 at 22:45
-
Note: it seems `PyPDF2` does not like `/Producer 1557 0 R` (with `1557 0 obj (Mac OS X 10.7.5 Quartz PDFContext)`): "Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function." – Jongware Mar 26 '20 at 23:02
-
I did find some metadata! Will post an answer below. – Arya Mar 27 '20 at 00:22
1 Answers
1
When doing pdf.getDocumentInfo()
on a Document created by Google Docs, it returns {'/Producer': u'Skia/PDF m83'}
. I tested this on a few Google docs, and it seems to check out. It makes sense - Skia is a Google project, so must be what they use to generate documents on their backend.
So you can simply do:
import PyPDF2
GOOGLE_DOCS_PDF_METADATA = {'/Producer': u'Skia/PDF m83'}
def file_is_google_doc(pdf_file_path)
pdf = PyPDF2.PdfFileReader(pdf_file_path)
return pdf.getDocumentInfo() == GOOGLE_DOCS_PDF_METADATA

Arya
- 1,382
- 2
- 15
- 36
-
1As I mentioned in my comment, be aware that this will be used by other Google projects as well - such as Google Chrome. The version number will also change over time (Chrome uses m80). – MatsLindh Mar 27 '20 at 08:55
-