I want to read the infos (width, height and DPI) from an image embedded in a PDF file with only one page. Im using pyMuPDF:
import fitz
pdf_file = fitz.open(filepath)
for page in pdf_file:
images = page.get_images() # returns an empty list [] :(
contents = page.get_contents() # returns a list with one xref: [10]
pdf_file.xref_is_stream(10) # trying this I got a True, so the image in PDF are stored as a stream
stream = pdf_file.xref_stream(10) # so I extracted the stream
When I open the pdf file, I can see the image in it. The first caracters in the stream are:
1.00000 0.00000 0.00000 1.00000 0.0000 0.0000 cm\r\n/GS11 gs\r\n/OC /Pr12 BDC\r\nq\r\nq\r\nq\r\n/GS13 gs\r\n/CS14 cs 0.0000 0.0000 0.0000 1.0000 scn\r
I know PIL image use this to identify the image format. Trying to read this in an image, I did:
img = Image.open(stream) # *** ValueError: embedded null byte
img_stream = io.BytesIO(stream)
img_stream.seek(0)
img = Image.open(img_stream) # *** PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fc0d1ff4810>
I cannot use Image.frombytes
since I don't know the dimensions of the image. Im just trying to get this information.
This PDF contains an image that will be used to create a film matrix for the packaging production industry. Image size is crucial for them to calculate how much film to use. The client said the image is a high quality TIFF embedded in the PDF.
Any idea how to convert this stream in a image to read this information.