-1

I have a diagram in a PDF format. I am using pdfminer.six to extract the text present in the diagram as well as the bounding boxes of the text. Everything is fine so far.

System info: Windows 10, Python 3.9.13

Now I want to draw these bounding boxes on an image of the pdf and create a visualization using OpenCV rectangle(). I am unsure about how to do this as the DPI is needed to convert the pdf to an image using pdf2image.

Can anyone tell me how to draw this visualization using the bounding box data given by pdfminer?

Code I am providing an example code of the bounding box extraction using pdfminer as well as a sample output to show how the bounding boxes are returned.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LAParams
import cv2 as cv
import os
import numpy as np

path = r"sample.pdf"
assert os.path.exists(path), "image path is wrong"

laparams = LAParams(detect_vertical=True)

for page_layout in extract_pages(path, laparams=laparams):
    for element in page_layout:
        if isinstance(element, LTTextBox):
            print(element.bbox)

A snapshot of the output I am getting is as follows:

....
(64.46833119999998, 758.4685649600001, 143.16671221999994, 763.35576496)
(399.3279, 797.7414805999999, 464.28060816000004, 812.3556692000001)
(520.1078, 797.7414805999999, 631.1472937599999, 812.3556692000001)
(676.9479, 797.7414805999999, 762.4986252799999, 812.3556692000001)
(709.8279, 787.0014805999999, 729.9863304, 792.5868806)
....
tintin98
  • 91
  • 9
  • 1
    The PDF itself should tell you its physical dimensions in the document metadata. There is no "dots per inch" value for PDFs themselves: they consists of (mostly) vector graphics, which aren't defined by dots/pixels. There is no DPI value until you're rendering the PDF's vector graphics as dots/pixels (e.g. when printing, or rendering on a screen, etc) and then DPI value comes from the device you're using (e.g. printing to paper at 600dpi, or rendering to a screen with 336ppi), not from the PDF. – Mike 'Pomax' Kamermans Jul 31 '23 at 18:48
  • I am editing the question to get an answer on how to get the visualization in this case then. – tintin98 Jul 31 '23 at 18:51

1 Answers1

0

This is how I solved the problem: the bounding boxes of all textlines are converted into Pandas data frame. (You can use a list as well) I use this function to draw a rectangle in the page:

def  draw(img,BBOX,Size,randomcolor:bool=False):
    rsx = int(np.floor(BBOX[0]))
    rsy = int(np.floor(Size[3])-np.floor(BBOX[1]))
    rex = int(np.floor(BBOX[2]))
    rey = int(np.floor(Size[3])-np.floor(BBOX[3]))
    
    if (randomcolor):
        R = random.randint(20, 255)
        G = random.randint(20, 255)
        B = random.randint(20, 255)
    else:
        R = 255
        G = 255
        B = 255
    # cv.rectangle(img,(rsx,rsy),(rex,rey), (red,green,0),1)
    cv.rectangle(img,(rsx,rsy),(rex,rey), (B,G,R),1)

and then the following function handles the input

def draw_textlines(list_of_BBOXes,page_size,filename,randomcolor=False):
  img = np.zeros((int(page_size[3]),int(page_size[2])), np.uint8)
  for BBOX in list_of_BBOXes:
    draw(img,row.BBOX,row.size,randomcolor)
  cv.imwrite(   filename, img)

The media box of the page can be used as the image size (when coverted to int) You can scale the output page as you wish.

Ali
  • 1
  • 1
  • Can you explain the parameters of `draw_textlines()`? – tintin98 Aug 01 '23 at 11:45
  • list_of_BBOXes : with PDFminer, each textline comes with a bounding box. So you can create a list of bounding boxes for all text lines in a page. i.e. [(x0,y0,x1,y10), .......] page size is page media box which determines the size of your image. you need it in the function to position the first textline in the pdf page at the top of your image file name is just the name of the file you want to write your image to and then you can choose what color to use for drawing you rectangle. – Ali Aug 03 '23 at 09:14