6

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract text from within a given bounding box. That code looks something like this:

s = "pdftotext -x %d -y %d -w %d -h %d"
s = s%(<various inputs into my function>)
cmd = [s, pdf_path,
           text_out]
subprocess.call(cmd)

However, this outputs/writes a text file. I want to use that text ~immediately, meaning I don't want to go and have to open a text file to retrieve whatever words were in that bounding box as I'll be doing that for 10,000+ documents and opening that many files might be a pain. I'm basically running the command line prompt from my python script, so I don't think there'll actually be a way around that, but I'm unsure. Since pdfminer & pypdf are actual python packages, I can get their text, but they don't appear to have any means of extracting text within given pixel limits.

As a further note - I'm looking to do this in python specifically, as I have a ton of other code for the same overarching project.

Evan Mata
  • 500
  • 1
  • 6
  • 19

3 Answers3

5

The PyMuPDF/Fitz Package works for this. They provide a script & documentation at: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction

Their script works by finding the bounding words, you can instead replace it by a rectangle by simply doing rect = fitz.Rect(x0, y0, x1, y1) instead of their rect = ~their stuff~. Also pno is the page number you're extracting from if its not clear.

Gangula
  • 5,193
  • 4
  • 30
  • 59
Evan Mata
  • 500
  • 1
  • 6
  • 19
0

You can open the text file using text=open(text_out,'r').read() which will put all the text from that text file into one string. You can then parse out that string into a list of strings using text.split('your_delimiter') depending on the delimiter you choose.

bmsmith
  • 58
  • 7
  • I'm aware of this, this is my current approach. I'm seeking a way that gets around the need to use the open() command, as I don't believe openning files is a particularly fast or efficient process and I'll be opening ~40,000 total files in the end. – Evan Mata Apr 09 '19 at 13:23
0

minimal example using PyMuPDF and PDF that has text embedded (you can select the text in the PDF):

import fitz

##x1, y1, x2, y2
bbox = [56, 170, 220, 230]

doc = fitz.open(fileName)

for page_num,page in enumerate(doc.pages()):
    print(page.get_textbox(bbox))
grantr
  • 878
  • 8
  • 16