-1

The goal: scan papers on a document scanner into one PDF file and then process them by OpenCV.

The expected result is a program like this:

  1. Extract one image from a PDF file as something binary.
  2. Convert it into OpenCV's Mat.
  3. Treat the Mat like image processing, line detection, and so on.
  4. Repeat for the next image.
Sergey Zaykov
  • 523
  • 2
  • 9
  • you might have noticed the downvotes. perhaps word this question a little differently. I think it's okay to "ask a question" and then answer it yourself for the purpose of constructing a canonical question+answer, where similar questions in the past have been worded poorly or received no useful answers. I don't know what is required of such a situation though. – Christoph Rackwitz Jan 05 '22 at 10:11

1 Answers1

2

The program use PyMuPDF package for extract images from PDF file:

  1. Open PDF file as a content manager: with fitz.Document(file) as doc.
  2. Get a set of image's XREFs via Set Comprehensions. More details are in the notes.
  3. Loop over the set of XREFs.
  4. In every iteration:
  5. Extract the dictionary containing one image with image_dict = doc.extract_image(xref). An alternative way is create a Pixmap.
  6. Now the key "image" holds image data, usable as image file content, and key "bpc" holds the number of bits per component.
  7. The number of bits per component use for calculate type of NumPy array's element as np.dtype(f'u{image_dict["bpc"] // 8}') where f'u{image_dict["bpc"] // 8}' is Array-protocol type strings like 'u1', 'u2' which is NumPy data type as one-byte and two-byte unsigned integer. See notes.
  8. Create NumPy array by mean NumPy function frombuffer with type as described above.
  9. OpenCV function imdecode converts the array into OpenCV's Mat.
  10. Show the Mat on the screen by meaning OpenCV's imshow().
  11. Go to the next image.
import fitz
import numpy as np
import cv2

# file path you want to extract images from
file = "test.pdf"

with fitz.Document(file) as doc:
    for xref in {xref[0] for page in doc for xref in page.get_images(False) if xref[1] == 0}:
        # dictionary with image
        image_dict = doc.extract_image(xref)
        # image as OpenCV's Mat
        i = cv2.imdecode(np.frombuffer(image_dict["image"],
                                       np.dtype(f'u{image_dict["bpc"] // 8}')
                                       ),
                         cv2.IMREAD_GRAYSCALE)
        cv2.imshow("OpenCV", i)
        cv2.waitKey(0)

Notes:

  1. Set Comprehensions contains two loops and one condition. First loop for page in doc iterates over pages in PDF document, the second for xref in page.get_images(False) iterates over images' XREF located on the page. The condition if xref[1] == 0 cuts off “pseudo-images” (“stencil masks”) with the special purpose of defining the transparency of some other image. Yes, alpha-channel (transparency) be destroyed. Very likely in the case of scanned papers, images don't contain “stencil masks”, so it may be overkill.
  2. Probably, using the set is overkill because PDF files originated from the document scanner only contain unique images but theoretically is possible to refer to one image from several pages.
  3. Determining NumPy data type by mean "bpc" is a very risky part of the program. If the number of bits per component is less than 8, a zero-byte data type will be declared so I think the program will get an error. I think using data type numpy.uint8 is enough in most cases.
Sergey Zaykov
  • 523
  • 2
  • 9