2

I have a requirement that to extract a text which in a rectangle from Pdf. There are several methods I have tested. But not getting specific text. For example I tested with PyMuPDF, pdfplumber, tabula, camelot, pdftables packages. In PyMuPDF module it is asking for beginning and ending words to extract text. As my understanding remaining packages also just extracting lines, curves information but not text.

I want to get the text from rectangles in a PDF without providing any starting and ending text.

https://drive.google.com/file/d/1wCvik7VbEvDwbT-mapgXc8fwlq7Ao3BP/view?usp=sharing

halfer
  • 19,824
  • 17
  • 99
  • 186
Kamaal Shaik
  • 57
  • 1
  • 9
  • Can you provide a copy of the PDF from which you are trying to extract the text? And also the text in the PDF that you want to extract. Without it, we would be only guessing. – moys Feb 13 '20 at 08:02
  • Sure. Give me 5 minutes I will prepare and provide. Because I am using the PDF which is confidential. – Kamaal Shaik Feb 13 '20 at 08:04
  • Hi moys, I edited the question and added PDF. can you please check now? – Kamaal Shaik Feb 13 '20 at 08:26
  • I'd recommend using Pillow (or some other image recognition) to first get the coordinates of the rectangle, and then use those coords in pymupdf to get the text inside. I have done the second, not sure is the former is possible though. – Sumant Agnihotri Apr 22 '20 at 08:27

2 Answers2

0

You can use the code below

import PyPDF2
def convert_pdf_to_text (document):
    read_pdf = PyPDF2.PdfFileReader(document, strict=False)
    number_of_pages = read_pdf.getNumPages()

    alltext1=""
    for page_number in range(number_of_pages):
        page = read_pdf.getPage(page_number)
        alltext1 += page.extractText()
    return alltext1.replace("\n", "")
convert_pdf_to_text ('pdf_test.pdf')

Output

'A Simple PDF File  This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...  Details  State: State_name     City: City_name    Country: Country_name     Rig No: 4455555  Source Id: k4-3k44 '
moys
  • 7,747
  • 2
  • 11
  • 42
  • OK. Thanks for response moys. Let me check. – Kamaal Shaik Feb 13 '20 at 08:37
  • 2
    I think the code is extracting entire text from a pdf. But we need the text which is in rectangle straightly. – Kamaal Shaik Feb 13 '20 at 08:39
  • Hi moys, can you help to extract only text which is in rectangle box please? – Kamaal Shaik Feb 13 '20 at 08:47
  • what is the salient feature of your rectangle? Will it be in same place on each page? will it have the same content? there should be something that defines the location of this rectangle. what is that? – moys Feb 13 '20 at 09:00
  • Hi moys, sorry for late reply, the real requirement is, the rectangle can be any where in the page and can be multiple rectangles. Also there is no fixed text for rectangles. The rectangles are dynamic. – Kamaal Shaik Feb 13 '20 at 09:22
  • @KamaalShaik i have same issue. if you got any soln. please let me know. – Devang Hingu Feb 28 '22 at 12:42
0

You can use the method Page.get_textbox from the PyMuPDF module.

For example:

import fitz

doc = fitz.open('pdf_test.pdf')
page = doc[0]  # get first page
rect = fitz.Rect(0, 0, 600, page.rect.width)  # define your rectangle here
text = page.get_textbox(rect)  # get text from rectangle
clean_text = ' '.join(text.split())

print(clean_text)

Relevant docs:

Diego Miguel
  • 531
  • 4
  • 13