0

I want to read the infos (width, height and DPI) from an image embedded in a PDF file with only one page. Im using pyMuPDF:

import fitz
pdf_file = fitz.open(filepath)
for page in pdf_file:
    images = page.get_images() # returns an empty list [] :(
    contents = page.get_contents() # returns a list with one xref: [10]
    pdf_file.xref_is_stream(10) # trying this I got a True, so the image in PDF are stored as a stream
    stream = pdf_file.xref_stream(10) # so I extracted the stream

When I open the pdf file, I can see the image in it. The first caracters in the stream are:

1.00000 0.00000 0.00000 1.00000 0.0000 0.0000 cm\r\n/GS11 gs\r\n/OC /Pr12 BDC\r\nq\r\nq\r\nq\r\n/GS13 gs\r\n/CS14 cs 0.0000 0.0000 0.0000 1.0000 scn\r

I know PIL image use this to identify the image format. Trying to read this in an image, I did:

img = Image.open(stream) # *** ValueError: embedded null byte

img_stream = io.BytesIO(stream)
img_stream.seek(0)
img = Image.open(img_stream) # *** PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fc0d1ff4810>

I cannot use Image.frombytes since I don't know the dimensions of the image. Im just trying to get this information.

This PDF contains an image that will be used to create a film matrix for the packaging production industry. Image size is crucial for them to calculate how much film to use. The client said the image is a high quality TIFF embedded in the PDF.

Any idea how to convert this stream in a image to read this information.

1 Answers1

0

In PyMuPDF you have full access to an embedded image. If page.get_images() is empty, then there are no images that can be reached via an xref! There are however multiple reasons, why you still may see something like an image:

  1. it is no image, but a drawing (synonyms: line art, vector graphics)
  2. it is an "embedded" image: this kind is only known to the page. The are (or should be) typical small.

An ultimate check for does a page truly have images is page.get_image_info() which is a list of all images: whether xref-based or embedded. If this list is empty, then the page in fact has no images - independent from your visual impression.

Interestingly, all page images can also be extracted via page.get_text("dict")["blocks"] if you subselect this list to image blocks. Here you will be given image metadata and the binary image stream as well.

What you tried to do is reading the page's /Contents object. It in fact is a stream - but its contents has nothing to do with the page's images.


Here is how to output (render) a document page to images:

import fitz
doc=fitz.open(filepath)
for page in doc:
    pix = page.get_pixmap(dpi=150)  # render page to an internal image format
    # now output as desired image file:
    pix.save(f"page-{page.number}.png")  # PNG file
    # or using Pillow:
    pix.pil_save(... Pillow args ...)  # any arguments for saving Pillow Images
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
  • You are right! Ty very much! I opened the PDF on LibreOffice Draw and indeed it has vectors. I hadn't thought of that because the client was very sure that the pdf contains a tif! :D Can I convert this in tif using PyMuPDF? If so, can you edit your answer with a link or some direction to where I can find a tutorial? I'll mark your answer as accepted. – Márcio Duarte Feb 16 '23 at 22:21
  • PyMuPDF supports **rendering** a document page,i.e. create an image that looks like what you see in a PDF viewer. Its genuine image **output** formats are PNG, JPEG, PPM, PS, PSD - not TIFF, or GIF (although a round dozen of input image formats are accepted, including TIFF). We have an elegant interface to Pillow however, which almost seamlessly allows output in all formats Pillow is capable of. But I am not quite sure if rendering the page as an image is what you want. Nevertheless have a look at my answer. – Jorj McKie Feb 17 '23 at 23:23