Get text based on coordinates as same format as in PDF

Asked Mar 13 '23 at 13:36

Active Mar 24 '23 at 05:11

Viewed 353 times

I have coordinate details, but I'm unable to find any method in pymupdf to fetch a block of data based on the coordinates. Is there any method in pymupdf that can do this? I'm open to other libraries, though I already used PDFQuery which is not working properly.

Explanation: I want to read block of text within the given coordinates using pymupdf. For example, if I have coordinates x0, y0, x1, y1 I should be able to get the text within the block the same format as in PDF.

For example: if I do

print(page.get_textbox(fitz.Rect([40.91999816894531, 274.94500732421875, 349.88214111328125, 364.9531555175781])))

It's giving me a string with each word in that block separated by a new line. Is there a way I can get the block as the same format as in PDF?

edited Mar 24 '23 at 05:11

Ben the Coder

asked Mar 13 '23 at 13:36

m9m9m

1,655
3
21
41

Are you referring to pdfminer or to pymupdf? **PyMuPDF definitely is able** to deliver all coordinates of all text - down to each single character if needed. – Jorj McKie Mar 15 '23 at 11:56
1

With PyMuPDF, you can extract text of the whole page, or from any sub-rectangle you want. All this is also documented in detail - so please be more specific, what your problem is. – Jorj McKie Mar 15 '23 at 11:58
I have added more details in the question @JorjMcKie – m9m9m Mar 15 '23 at 13:32
Is pdftotext a python library? @KJ – m9m9m Mar 15 '23 at 18:29

Get text based on coordinates as same format as in PDF

0 Answers0