Reading the PDF properties/metadata in Python

Question

How can I read the properties/metadata like Title, Author, Subject and Keywords stored on a PDF file using Python?

score 56 · Accepted Answer · edited Jul 02 '19 at 10:41

56

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

print(doc.info)  # The "Info" metadata

Here's the output:

>>> [{'CreationDate': 'D:20040520151901-0500',
  'Creator': 'DocBook XSL Stylesheets V1.52.2',
  'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
  'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
  'Title': 'Dive Into Python'}]

For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.

edited Jul 02 '19 at 10:41

Guillaume Jacquenot

11,217
6
43
49

answered Jan 08 '13 at 06:22

namit

6,780
4
35
41

1

A heads-up: the author of pdfminer says it's incompatible with Python 3, at least as of date of this post ([link](https://github.com/euske/pdfminer/)) – JSmyth Jan 05 '14 at 22:36
9

As of November 2013, the "PDFDocument class now takes a PDFParser object as an argument. PDFDocument.set_parser() and PDFParser.set_document() is removed." So you can just do doc=PDFDocument(parser), and skip the calls to set_document, set_parser, and initialize. – Derek Kurth Oct 14 '14 at 15:55
@JSmyth The [PyPi Index](https://pypi.python.org/pypi?%3Aaction=search&term=pdfminer&submit=search) currently lists three working `pdfminer` forks that are compatible with Python 3. `pip search pdfminer` – zero2cx Jan 19 '17 at 07:10
@zero2cx thanks for the update. I personally settled on [pdfminer3k](https://github.com/jaepil/pdfminer3k/). Works well for my purposes. One has to read the API document in the repo though as the accepted here answer is not a valid API for pdfminer3k anymore. – JSmyth Jan 21 '17 at 23:33
4

There is now an official Python 3 fork of the project https://github.com/pdfminer/pdfminer.six – Harsh Jun 04 '17 at 12:46
Using `pdfminer.six `, for this paper https://z0ngqing.github.io/paper/nips-jiechuan18.pdf, I got an empty string for the title. – GoingMyWay Nov 19 '18 at 12:11

score 21 · Answer 2 · edited Mar 23 '23 at 07:48

21

As the maintainer of pypdf I strongly recommend pypdf :-)

from pypdf import PdfReader

reader = PdfReader("test.pdf")

# See what is there:
print(str(reader.metadata))

# Or just access specific values:
print(reader.metadata.creation_date)  # that is actually a datetime object!

Install using pip install pypdf --upgrade.

See also: How to read/write metadata with pypdf

edited Mar 23 '23 at 07:48

Martin Thoma

124,992
159
614
958

answered Oct 08 '16 at 11:31

Morten Zilmer

15,586
3
30
49

1

I tried @Rabash's answer and it gives me similar results. I think this is better becase it gives better information about the creator. This code's creator output is 'Microsoft...' and Rabash's code gives me some encoded characters. – Heriberto Juarez Aug 06 '19 at 20:22

score 6 · Answer 3 · edited Mar 23 '23 at 07:53

I have implemented this using pypdf. Please see the sample code below. pypdf is maintained again since December 2022. The PyPDF2 project was merged back into pypdf.

from pypdf import PdfReader
pdf_toread = PdfReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.metadata
print(str(pdf_info))

Output:

{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}

score 6 · Answer 4 · answered Dec 19 '16 at 01:36

For Python 3 and new pdfminer (pip install pdfminer3k):

import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument

fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
    info = doc.info[0]
    print(info)

score 6 · Answer 5 · edited Feb 06 '23 at 22:29

pikepdf provides an easy and reliable way to do this.

I tested this with a bunch of pdf files, and it seems there are two distinct ways to insert metadata when the PDF is created. Some are inserting NUL bytes and other gibberish. Pikepdf handles both well.

import pikepdf
p = pikepdf.Pdf.open(r'path/to/file.pdf')
str(p.docinfo['/Author'])  # mind the slash

This returns a string - if you wrapped it with str. Examples:

'Normal person'
'ABC'

Comparing with other options:

pdfminer - Not actively maintained
pdfminer.six - active
pdfreader - active (but still suggest you to use easy_install, a.o.)
pypdf - Active.
PyPDF2 - was merged back into pypdf. PyPDF2==3.0.0 and pypdf==3.1.0 are essentially the same, but development continues in pypdf
Borb - Active.

Pdfminer.six:

pip install pdfminer.six

import pdfminer.pdfparser
import pdfminer.pdfdocument
h = open('path/to/file.pdf', 'rb')
p = pdfminer.pdfparser.PDFParser(h)
d = pdfminer.pdfparser.PDFDocument(p)
d.info[0]['Author']

This returns a binary string, including the non-decodable characters if they are present. Examples:

b'Normal person'
b'\xfe\xff\x00A\x00B\x00C' (ABC)

To convert to a string:

b'Normal person'.decode() yields the string 'Normal person'
b'\xfe\xff\x00A\x00B\x00C'.decode(encoding='utf-8', errors='ignore').replace('\x00', '') yields the string 'ABC'

pdfreader

pip install pdfreader

import pdfreader
h = open(r'path/to/file.pdf', 'rb')
d = pdfreader.PDFDocument(h)
d.metadata['Author']

This returns either the string with the requested information, or a string containing the hex representation of the data it found. This then also includes the same non-decodable characters. Examples:

'Normal person'
'FEFF004100420043' (ABC)

You would then first need to detect whether this is still 'encoded', which I think is quite a nuisance. The second can be made a sensible string by calling this ugly piece of code:

s = 'FEFF004100420043'
''.join([c for c in (chr(int(s[i:i+2], 16)) for i in range(0, len(s), 2)) if c.isascii()]).replace('\x00', '')
>>> 'ABC'

Borb

pip install borb

import borb.pdf.pdf
h = open(r'path/to/file.pdf', 'rb')
d: borb.pdf.document.Document = borb.pdf.pdf.PDF.loads(h)
str(d.get_document_info().get_author())

This returns a string - if you wrapped it with str. Loading a sizeable PDF takes a long time. I had one PDF on which borb choked with a TypeError exception. See also the examples on borb's dedicated example repo.

score 0 · Answer 6 · answered Nov 26 '19 at 21:40

0

Try pdfreader You can access document catalog Metadata like below:

   from pdfreader import PDFDocument    
   f = open("foo.pdf", 'rb')
   doc = PDFDocument(f)
   metadata = doc.root.Metadata

answered Nov 26 '19 at 21:40

Maksym Polshcha

18,030
8
52
77

Thank you ! Can you please detail why the python market of PDF libraries needed another solution? Which shortcomings does it address ? Cheers! – Ciprian Tomoiagă Dec 04 '19 at 15:00
@CiprianTomoiagă As for me the best tool at the moment is *pdfminer* but it is very slow on big documents and not always good for parsing text data. – Maksym Polshcha Dec 05 '19 at 20:20

Reading the PDF properties/metadata in Python

6 Answers6

Pdfminer.six:

pdfreader

Borb

Linked