Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

Question

Code

import pdfplumber

ecdata = ""

with pdfplumber.open("XYZ Transcript.pdf") as pdf:
    for i in range(len(pdf.pages)):
        print("Page No.: ", i+1)
        page_obj = pdf.pages[i]
        page = page_obj.within_bbox((70, 50, page_obj.width, 790))
        ecpagedata = page.extract_text()
        ecdata += ecpagedata
        print(page.extract_text())

Output of the above code

The output required should only contain complete sentences of the file and not the unwanted bullets, headings and subheadings

Good day, and thank you for standing by. Welcome to the XYZ Second Quarter 2099 Earnings Conference Call. At this time, all participants are in a listen-only mode. After the speakers' presentation, there will be a question-and-answer session. (Operator Instructions) Please be advised that today's conference is being recorded.

I would now like to hand the conference over to your speaker today, Alpha, Vice President of Investor Relations. Please go ahead.

Thank you, operator. Good afternoon and welcome to XYZ’s second quarter 2022 earnings call. I'm joined today by Bravo, XYZ’s Founder and CEO; and Charlie, our CFO. Full details of our results and additional management commentary are available in our shareholder letter, which can be found on our Investor Relations website at website.com/investor. Our comments and responses to your questions on this call reflect management's views as of today only and we disclaim any obligation to update this information. On this call, we'll make forward-looking statements which are predictions, projections, or other.

I am attaching the image of the PDF file here

Source file image

The source image file is my own creation and does not directly or indirectly represent any entity real or fictitious whatsoever.

This is obviously homework. You need to make a good faith effort to solve this yourself before coming to us with help. Hint: how can you identify a bullet point? How can you identify a heading? — Tim Roberts, Aug 12 '22 at 05:19

Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

0 Answers0