0

I am using pdf2image to convert pdf to images and detecting tables with table-transformers. I need help with coordinates.

Issue is, I am getting perfect table borders but pixels in images are different from PDF coordinates. Any way to convert image coordinates to PDF coordinates? Here is my code for reference:

from pdf2image import convert_from_path

images = convert_from_path('/content/Sample Statement Format Bancslink.pdf')

for i in range(len(images)):
  images[i].save('/content/pages_sbi/page'+str(i)+'.jpeg')
  • That function renders PDF pages to JPEG images. For each page image you now want to know where (in pixels) to find e.g. a table of which you know the PDF coordinates? – Jorj McKie May 22 '23 at 08:06
  • My first take would be to look at the DPI side of things. `pdf2image` seems to use 200 DPI by default. I'd try [convert](https://pdf2image.readthedocs.io/en/latest/reference.html#pdf2image.pdf2image.convert_from_path) using the same DPI than your original PDF file. – Er... May 22 '23 at 08:57

2 Answers2

1

Here is how to use PyMuPDF to transform image coordinates back to PDF page coordinates.

This of course works page by page. So in the following, an image file is assumed to be made from the corresponding page.

import fitz  # PyMuPDF import

doc = fitz.open("input.pdf")
page = doc[pno]  # page number pno is 0-based
image = f"image{pno}.jpg"  # filename of the matching image of the page

# rectangle, e.g. one that wraps a table in the image
# x0, y0 are coordinates of its top-left point
# x1, y1 is the bottom-right point
rect = fitz.Rect(x0, y0, x1, y1)

# make a PyMuPDF iamge from the JPEG
pix = fitz.Pixmap(image)

# make a matrix that converts any image coordinates to page coordinates
mat = pix.irect.torect(page.rect)

# now every image coordinate can be converted to page coordinates
# e.g. this is the table rect in page coordinates:
pdfrect = rect * mat

# if you don't want PyMuPDF objects as rectangle, just use
# tuple(pdfrect) to retrieve the 4 coordinates

Just as an aside, PyMuPDF is also able to render pages to images. So if your table detection mechanism can be invoke page, by page, you could make a loop like this:

  1. Read page using PyMuPDF
  2. Convert page to an image. Could be in memory, too.
  3. Pass page image to table recognizer, which returns table coordinates
  4. Use table coordinates and convert them to page coordinates as shown above.
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
0

Alright, found perfect solution which will work on almost all problems.
Consider this as your code for PDF to Image:

from pdf2image import convert_from_path

images = convert_from_path('PATH')

!mkdir pages

for i in range(len(images)):
  images[i].save('/content/pages/page'+str(i)+'.jpeg')

Now, you need to get data of PDF first:

from pypdf import PdfReader

reader = PdfReader('PATH')
box = reader.pages[0].mediabox

pdf_width = box.width
pdf_height = box.height

Now read and get data about image:

import cv2
im = cv2.imread('/content/pages/page0.jpeg')
height, width, channels = im.shape 

Now consider x_1, x_2, y_1 and y_2 as coordinates in image. To get location of same in PDF, use following code:

x_1  = x_1/width*pdf_width
y_1  = y_1/width*pdf_width
x_2  = x_2/width*pdf_width
y_2  = y_2/width*pdf_width

Use this coordinates for your work.