1

I need a simple python library to convert PDF to image (render the PDF as is), but after hours of searching, I keep hitting the same wall, I find libraries like pdf2image python library (and many similar ones), which depend on external applications or wrap command-line tools.

Although there are workarounds to allow using these libraries in serverless settings, they all would complicate our deployment and require creating the likes of Execution Environments or extra lambda layers, which will eat up from the small allowed lambda size.

Is there a self-contained, independent mechanism (not dependent on command-line tools) to allow achieving this (seemingly simple) task?

Also, I am wondering, is there a reason (licensing or patents) for the scarcity of tools that deal with PDFs (they are mostly commercial or under strict AGPL licenses)?

Saw
  • 6,199
  • 11
  • 53
  • 104
  • https://en.wikipedia.org/wiki/List_of_PDF_software is a good resource – Saw Aug 27 '21 at 11:15
  • QPDF seems to be a reasonable library (there is an established Pythonic wrapper not based on command line crap), but can't find any way to convert/render PDF to images :( – Saw Aug 27 '21 at 11:21
  • Oh, didn't know they are based on Ghostscript, this is brutal, all seem to use the same stuff in the background, how could they offer it as Apache then? really strange – Saw Aug 27 '21 at 11:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236481/discussion-between-sawan-and-k-j). – Saw Aug 27 '21 at 11:25

2 Answers2

1

You said "Ended up using pdf2image"

pdf2image (MIT). A python (3.6+) module that wraps pdftoppm (GPL?) and pdftocairo (GPL?) to convert PDF to a PIL Image object.

Generally Poppler (GPL) spinoffs from Open Source Xpdf (GPL) which has

  • pdftopng:
  • pdftoppm:
  • pdfimages:

and a 3rd party pdftotiff

K J
  • 8,045
  • 3
  • 14
  • 36
  • 1
    We ended up not using this because of licensing limitations, and using a SaaS service until we figure out a library, a bit annoying tbh! – Saw Sep 01 '21 at 18:31
  • 1
    The use case is so small, not worth losing sleep over it for us. – Saw Sep 01 '21 at 18:32
0

You can convert PDF's to images without external dependencies using PyMuPDF. I use it for Azure functions.

Install with pip install PyMuPDF

In your python file:

import fitz
pdfDoc = fitz.open(filepath)
img = pdfDoc[0].get_pixmap(matrix=fitz.Matrix(2,2))
bytesimg = img.tobytes()

This takes the first page of the PDF and converts it to an image, the matrix is for the resolution.

You can also open a stream instead of a file on disk:

pdfDoc = fitz.open(stream = pdfstream, filetype="pdf")
Jacob-Jan Mosselman
  • 801
  • 1
  • 10
  • 18