7

I am using following code to draw rectangle on an image text for matching date pattern and its working fine.

import re
import cv2
import pytesseract
from PIL import Image
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())

date_pattern = '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        if re.match(date_pattern, d['text'][i]):
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('img', img)
cv2.waitKey(0)
img.save("sample.pdf")

Now, at the end I am getting a PDF with rectangle on matched date pattern.

I want to give this program scanned PDF as input instead of image above. It should first convert PDF into image format readable by opencv for same processing as above. Please help. (Any workaround is fine. I need a solution in which I can convert PDF to image and use it directly instead of saving on disk and read them again from there. As I have lot of PDFs to process.)

P.Natu
  • 131
  • 1
  • 3
  • 12

3 Answers3

9

There is a library named pdf2image. You can install it with pip install pdf2image. Then, you can use the following to convert pages of the pdf to images of the required format:

from pdf2image import convert_from_path

pages = convert_from_path("pdf_file_to_convert")
for page in pages:
    page.save("page_image.jpg", "jpg")

Now you can use this image to apply opencv functions.

You can use BytesIO to do your work without saving the file:

from io import BytesIO
from PIL import Image

with BytesIO() as f:
   page.save(f, format="jpg")
   f.seek(0)
   img_page = Image.open(f)
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
dewDevil
  • 381
  • 1
  • 3
  • 12
  • I have such large number of PDFs and also multipage PDFs so it will be better if I can use image object directly as an input to opencv function. Saving images and reading them again will take time as well as space in my case. Please advise. – P.Natu May 16 '20 at 09:21
  • The edited version of my Post you can see where you can work without saving the image, which will act just as in memory and not written on disk. – dewDevil May 16 '20 at 10:17
  • 1
    Panics for me: with io.BytesIO as f: AttributeError: `__enter__` It seems BytesIO needs direct input in the constructor. – 00zetti Apr 29 '21 at 14:39
  • 1
    It should be `with BytesIO() as f` not `BytesIO`, & also specify `format='jpeg'` if you face issues at that line. – Mohith7548 Jul 08 '21 at 07:16
2

From PDF to opencv ready array in two lines of code. I have also added the code to resize and view the opencv image. No saving to disk.

# imports
from pdf2image import convert_from_path
import cv2
import numpy as np

# convert PDF to image then to array ready for opencv
pages = convert_from_path('sample.pdf')
img = np.array(pages[0])

# opencv code to view image
img = cv2.resize(img, None, fx=0.5, fy=0.5)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Remember if you do not have poppler in your Windows PATH variable you can provide the path to convert_form_path

poppler_path = r'C:\path_to_poppler'
pages = convert_from_path('sample.pdf', poppler_path=poppler_path)

Cam
  • 1,263
  • 13
  • 22
  • Thanks for answer. But note that pages that you get from pdf2image are PIL images in RGB format. However opencv stores images in BGR format. You need to convert image as `img = img[..., (2, 1, 0)]`. Otherwise colors will be cursed with red turning into blue and vice versa. – Усердный бобёр May 24 '23 at 13:04
1

You can use the library pdf2image. Install with this command: pip install pdf2image. You can then convert the file into one or multiple images readable by cv2. The next sample of code will convert the PIL Image into something readable by cv2:

Note: The following code requires numpy pip install numpy.

from pdf2image import convert_from_path
import numpy as np

images_of_pdf = convert_from_path('source2.pdf')  # Convert PDF to List of PIL Images
readable_images_of_pdf = []  # Create a list for thr for loop to put the images into
for PIL_Image in images_of_pdf:
    readable_images_of_pdf.append(np.array(PIL_Image))  # Add items to list

The next bit of code can convert the pdf into one big image readable by cv2:

import cv2
import numpy as np
from pdf2image import convert_from_path

image_of_pdf = np.concatenate(tuple(convert_from_path('/path/to/pdf/source.pdf')), axis=0)

The pdf2image library's convert_from_path() function returns a list containing each pdf page in the PIL image format. We convert the list into a tuple for the numpy concatenate function to stack the images on top of each other. If you want them side by side you could change the axis integer to 1 signifying you want to concatenate the images along the y-axis. This next bit of code will show the image on the screen:

cv2.imshow("Image of PDF", image_of_pdf)
cv2.waitKey(0)

This will probably create a window on the screen that is too big. To resize the image for the screen you'll use the following code that uses cv2's built-in resize function:

import cv2
from pdf2image import convert_from_path
import numpy as np
image_of_pdf = np.concatenate(tuple(convert_from_path('source2.pdf')), axis=0)
size = 0.15 # 0.15 is equal to 15% of the original size.
resized = cv2.resize(image_of_pdf, (int(image_of_pdf.shape[:2][1] * size), int(image_of_pdf.shape[:2][0] * size)))
cv2.imshow("Image of PDF", resized)
cv2.waitKey(0)

On a 1920x1080 monitor, a size of 0.15 can comfortably display a 3-page document. The downside is that the quality is reduced dramatically. If you want to have the pages separated you can just use the original convert_from_path() function. The following code shows each page individually, to go to the next page press any key:

import cv2
from pdf2image import convert_from_path
import numpy

images_of_pdf = convert_from_path('source2.pdf')  # Convert PDF to List of PIL Images
count = 0  # Start counting which page we're on
while True:
    cv2.imshow(f"Image of PDF Page {count + 1}", numpy.array(images_of_pdf[count]))  # Display the page with it's number
    cv2.waitKey(0)  # Wait until key is pressed
    cv2.destroyWindow(f"Image of PDF Page {count + 1}")  # Destroy the following window
    count += 1  # Add to the counter by 1
    if count == len(images_of_pdf):
        break  # Break out of the while loop before you get an "IndexError: list index out of range"
Suave101
  • 21
  • 4