51

How can I read the properties/metadata like Title, Author, Subject and Keywords stored on a PDF file using Python?

wolφi
  • 8,091
  • 2
  • 35
  • 64
Quicksilver
  • 2,546
  • 3
  • 23
  • 37

6 Answers6

56

Try pdfminer:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

print(doc.info)  # The "Info" metadata

Here's the output:

>>> [{'CreationDate': 'D:20040520151901-0500',
  'Creator': 'DocBook XSL Stylesheets V1.52.2',
  'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
  'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
  'Title': 'Dive Into Python'}]

For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.

Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
namit
  • 6,780
  • 4
  • 35
  • 41
  • 1
    A heads-up: the author of pdfminer says it's incompatible with Python 3, at least as of date of this post ([link](https://github.com/euske/pdfminer/)) – JSmyth Jan 05 '14 at 22:36
  • 9
    As of November 2013, the "PDFDocument class now takes a PDFParser object as an argument. PDFDocument.set_parser() and PDFParser.set_document() is removed." So you can just do doc=PDFDocument(parser), and skip the calls to set_document, set_parser, and initialize. – Derek Kurth Oct 14 '14 at 15:55
  • @JSmyth The [PyPi Index](https://pypi.python.org/pypi?%3Aaction=search&term=pdfminer&submit=search) currently lists three working `pdfminer` forks that are compatible with Python 3. `pip search pdfminer` – zero2cx Jan 19 '17 at 07:10
  • @zero2cx thanks for the update. I personally settled on [pdfminer3k](https://github.com/jaepil/pdfminer3k/). Works well for my purposes. One has to read the API document in the repo though as the accepted here answer is not a valid API for pdfminer3k anymore. – JSmyth Jan 21 '17 at 23:33
  • 4
    There is now an official Python 3 fork of the project https://github.com/pdfminer/pdfminer.six – Harsh Jun 04 '17 at 12:46
  • Using `pdfminer.six `, for this paper https://z0ngqing.github.io/paper/nips-jiechuan18.pdf, I got an empty string for the title. – GoingMyWay Nov 19 '18 at 12:11
21

As the maintainer of pypdf I strongly recommend pypdf :-)

from pypdf import PdfReader

reader = PdfReader("test.pdf")

# See what is there:
print(str(reader.metadata))

# Or just access specific values:
print(reader.metadata.creation_date)  # that is actually a datetime object!

Install using pip install pypdf --upgrade.

See also: How to read/write metadata with pypdf

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Morten Zilmer
  • 15,586
  • 3
  • 30
  • 49
  • 1
    I tried @Rabash's answer and it gives me similar results. I think this is better becase it gives better information about the creator. This code's creator output is 'Microsoft...' and Rabash's code gives me some encoded characters. – Heriberto Juarez Aug 06 '19 at 20:22
6

I have implemented this using pypdf. Please see the sample code below. pypdf is maintained again since December 2022. The PyPDF2 project was merged back into pypdf.

from pypdf import PdfReader
pdf_toread = PdfReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.metadata
print(str(pdf_info))

Output:

{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Quicksilver
  • 2,546
  • 3
  • 23
  • 37
6

For Python 3 and new pdfminer (pip install pdfminer3k):

import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument

fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
    info = doc.info[0]
    print(info)
Rabash
  • 4,529
  • 3
  • 19
  • 18
6

pikepdf provides an easy and reliable way to do this.

I tested this with a bunch of pdf files, and it seems there are two distinct ways to insert metadata when the PDF is created. Some are inserting NUL bytes and other gibberish. Pikepdf handles both well.

import pikepdf
p = pikepdf.Pdf.open(r'path/to/file.pdf')
str(p.docinfo['/Author'])  # mind the slash

This returns a string - if you wrapped it with str. Examples:

  • 'Normal person'
  • 'ABC'

Comparing with other options:

  • pdfminer - Not actively maintained
  • pdfminer.six - active
  • pdfreader - active (but still suggest you to use easy_install, a.o.)
  • pypdf - Active.
  • PyPDF2 - was merged back into pypdf. PyPDF2==3.0.0 and pypdf==3.1.0 are essentially the same, but development continues in pypdf
  • Borb - Active.

Pdfminer.six:

pip install pdfminer.six

import pdfminer.pdfparser
import pdfminer.pdfdocument
h = open('path/to/file.pdf', 'rb')
p = pdfminer.pdfparser.PDFParser(h)
d = pdfminer.pdfparser.PDFDocument(p)
d.info[0]['Author']

This returns a binary string, including the non-decodable characters if they are present. Examples:

  • b'Normal person'
  • b'\xfe\xff\x00A\x00B\x00C' (ABC)

To convert to a string:

  • b'Normal person'.decode() yields the string 'Normal person'
  • b'\xfe\xff\x00A\x00B\x00C'.decode(encoding='utf-8', errors='ignore').replace('\x00', '') yields the string 'ABC'

pdfreader

pip install pdfreader

import pdfreader
h = open(r'path/to/file.pdf', 'rb')
d = pdfreader.PDFDocument(h)
d.metadata['Author']

This returns either the string with the requested information, or a string containing the hex representation of the data it found. This then also includes the same non-decodable characters. Examples:

  • 'Normal person'
  • 'FEFF004100420043' (ABC)

You would then first need to detect whether this is still 'encoded', which I think is quite a nuisance. The second can be made a sensible string by calling this ugly piece of code:

s = 'FEFF004100420043'
''.join([c for c in (chr(int(s[i:i+2], 16)) for i in range(0, len(s), 2)) if c.isascii()]).replace('\x00', '')
>>> 'ABC'

Borb

pip install borb

import borb.pdf.pdf
h = open(r'path/to/file.pdf', 'rb')
d: borb.pdf.document.Document = borb.pdf.pdf.PDF.loads(h)
str(d.get_document_info().get_author())

This returns a string - if you wrapped it with str. Loading a sizeable PDF takes a long time. I had one PDF on which borb choked with a TypeError exception. See also the examples on borb's dedicated example repo.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
parvus
  • 5,706
  • 6
  • 36
  • 62
0

Try pdfreader You can access document catalog Metadata like below:

   from pdfreader import PDFDocument    
   f = open("foo.pdf", 'rb')
   doc = PDFDocument(f)
   metadata = doc.root.Metadata
Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • Thank you ! Can you please detail why the python market of PDF libraries needed another solution? Which shortcomings does it address ? Cheers! – Ciprian Tomoiagă Dec 04 '19 at 15:00
  • @CiprianTomoiagă As for me the best tool at the moment is *pdfminer* but it is very slow on big documents and not always good for parsing text data. – Maksym Polshcha Dec 05 '19 at 20:20