-1

I have several PDFs that i want to extract data from. I have managed to use the code below to extract all the data from the PDF however now i want to extract text between two different headings. I believe using regex is the best way to do this as the text between the two headings will vary but the two headings will remain the same for each PDF.

This is an example PDF: https://www.scribd.com/document/396797318/123

I want to extract all the text between heading "3. Induction Training" and "4. Corporate Training/Departmental Training"

The following code is what I am using to extract the data from the PDF:

def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = BytesIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)

    for page in PDFPage.get_pages(filepath, check_extractable=False):
        interpreter.process_page(page)

    text = retstr.getvalue()

    filepath.close()
    device.close()
    retstr.close()
    return text

if __name__ == "__main__":
    text = pdf_to_text("123.pdf")
    print(text)

What regex can i use to get the information i need?

Jlingz14
  • 47
  • 6

1 Answers1

0

Try Regex: (?<=3\. Induction Training\n).*(?=4\. Corporate Training\/Departmental Training)

Demo

Matt.G
  • 3,586
  • 2
  • 10
  • 23
  • Thank you so much! Do you know where in the pdfminer code I should put it so it loops through the pages? – Jlingz14 Jan 07 '19 at 15:07
  • I believe, the text you are looking for occurs once per document.. so you need not check for the regex against each page in the document. You should be doing it after text = pdf_to_text("123.pdf") line. – Matt.G Jan 07 '19 at 15:34