The program use PyMuPDF package for extract images from PDF file:
- Open PDF file as a content manager:
with fitz.Document(file) as doc
.
- Get a set of image's XREFs via Set Comprehensions. More details are in the notes.
- Loop over the set of XREFs.
- In every iteration:
- Extract the dictionary containing one image with
image_dict = doc.extract_image(xref)
. An alternative way is create a Pixmap.
- Now the key "image" holds image data, usable as image file content, and key "bpc" holds the number of bits per component.
- The number of bits per component use for calculate type of NumPy array's element as
np.dtype(f'u{image_dict["bpc"] // 8}')
where f'u{image_dict["bpc"] // 8}'
is Array-protocol type strings like 'u1', 'u2' which is NumPy data type as one-byte and two-byte unsigned integer. See notes.
- Create NumPy array by mean NumPy function frombuffer with type as described above.
- OpenCV function imdecode converts the array into OpenCV's Mat.
- Show the Mat on the screen by meaning OpenCV's imshow().
- Go to the next image.
import fitz
import numpy as np
import cv2
# file path you want to extract images from
file = "test.pdf"
with fitz.Document(file) as doc:
for xref in {xref[0] for page in doc for xref in page.get_images(False) if xref[1] == 0}:
# dictionary with image
image_dict = doc.extract_image(xref)
# image as OpenCV's Mat
i = cv2.imdecode(np.frombuffer(image_dict["image"],
np.dtype(f'u{image_dict["bpc"] // 8}')
),
cv2.IMREAD_GRAYSCALE)
cv2.imshow("OpenCV", i)
cv2.waitKey(0)
Notes:
- Set Comprehensions contains two loops and one condition. First loop
for page in doc
iterates over pages in PDF document, the second for xref in page.get_images(False)
iterates over images' XREF located on the page. The condition if xref[1] == 0
cuts off “pseudo-images” (“stencil masks”) with the special purpose of defining the transparency of some other image. Yes, alpha-channel (transparency) be destroyed. Very likely in the case of scanned papers, images don't contain “stencil masks”, so it may be overkill.
- Probably, using the set is overkill because PDF files originated from the document scanner only contain unique images but theoretically is possible to refer to one image from several pages.
- Determining NumPy data type by mean "bpc" is a very risky part of the program. If the number of bits per component is less than 8, a zero-byte data type will be declared so I think the program will get an error. I think using data type numpy.uint8 is enough in most cases.