Build a dynamic index and then extract specific pages of PDF and save it with Python

Question

Currently I try to extract the Combined Management Report from the following annual report: https://www.merckgroup.com/content/dam/web/corporate/non-images/investors/reports-and-financials/earnings-materials/2021-q4/en/2021-Q4-Report-EN.pdf

I transformed the pdf into text: https://www.file-upload.net/download-15044794/MerckKGaA03-MAR-2022FullYear66228323.txt.html

Now I want to extract page 14-228(the whole chapter Management Report and Corporate Governance).

Currently I try to extract the Combined Management Report from the following annual report: https://www.merckgroup.com/content/dam/web/corporate/non-images/investors/reports-and-financials/earnings-materials/2021-q4/en/2021-Q4-Report-EN.pdf

I transformed the pdf into text: https://www.file-upload.net/download-15044794/MerckKGaA03-MAR-2022FullYear66228323.txt.html

Now I want to extract page 14-228(the whole chapter Management Report and Corporate Governance).

Currently I work with two commands, the first indicating the start position, and the second the "end position".

item7_regex=re.compile(r"Fundamental Information about",re.DOTALL)
item8_regex=re.compile(r"Consolidated\s+Financial\s*Statements",re.DOTALL)

And to extract the code i use:

def extract_mdna(plain_text:str, min_mdna_length=25000):
    section_start_match= item7_regex.search(plain_text)
    if section_start_match:
        section_start_pos=section_start_match.start()
        section_end_pos=section_start_pos
        while(section_end_pos<section_start_pos + min_mdna_length) and section_end_pos!=-1:
            section_end_match=item8_regex.search(plain_text,section_end_pos+1)
            if section_end_match:
                section_end_pos=section_end_match.start()
            else:
                section_end_pos=-1
            if section_end_pos>0:
                item7_text=plain_text[section_start_pos:section_end_pos]
                return item7_text
    return None
for x in fileList: 
    pepsi = open(r"C:\Users\hp\Desktop\Research\test\Merck KGaA 03-MAR-2022 Full Year 66228323.txt", 'r',encoding='utf-8').read()
    nur = extract_mdna(pepsi)
    print(nur)

It gives me the Table of Content, and not the actual chapter.

Now I found the following thread: Extract specific pages of PDF and save it with Python

It uses this variable:

information = [(filename1,startpage1,endpage1), (filename2, startpage2, endpage2), ...,(filename19,startpage19,endpage19)]

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfFileReader("example.pdf")

for page in range(len(information)):
    pdf_writer = PyPDF2.PdfFileWriter()
    start = information[page][1]
    end = information[page][2]
    while start<=end:
        pdf_writer.addPage(pdfReader.getPage(start-1))
        start+=1
    if not os.path.exists(savepath):
        os.makedirs(savepath)
    output_filename = '{}_{}_page_{}.pdf'.format(information[page][0],information[page][1], information[page][2])
    with open(output_filename,'wb') as out:
        pdf_writer.write(out)

It does work fine. My question is, is there a way to build a command that yields the information variable based on my files?

Coming back to my example, I have the table of content:

Fundamental Information about the Group

with respect to its Composition and Profile of Skills and Expertise

14

Merck

24

Strategy

31

Internal Management System

229 Consolidated Income Statement

38

Research and Development

230 Consolidated Statement of Comprehensive Income

51

Report on Economic Position

231 Consolidated Balance Sheet

51

232 Consolidated Cash Flow Statement

Macroeconomic and Sector-Specific Environment

55

Review of Forecast against Actual Business Development

87

63

Course of Business and Economic Position

63

Merck Group

73

Life Science

77

Healthcare

82

Electronics

86

Corporate and Other

Report on Risks and Opportunities

104 Report on Expected Developments 108 Report in accordance with section 315a of the German Commercial Code (HGB)

233 Consolidated Statement of Changes in Net Equity 234 Notes to the

Is there a way to automatically extract the page number from the TOC, i.e. the number to the left of the chapter name, and parse it automatically into the information variable?

Hey K J, i guess you are right. Still, I tried to the bitter end. In fact, I still think there might be a solution to my problem. As you can see at the top, my regex command does extract the table of content. I think of something like, search for the first numeric character that follows your search pattern, in my case this: Fundamental Information about the Group with respect to its Composition and Profile of Skills and Expertise 14 i.e. Page 14, and the same for the "end regex" to extract Starting page of my report and ending page. — Limps, Nov 24 '22 at 08:43
If it is not super clean, because page 14 from the toc is actually page 17 in the document, this still would be finde. Do you think something like this would be suitable? — Limps, Nov 24 '22 at 08:43
To be honest, starting with this task i did not anticipate it is so difficult to extract text from a pdf — Limps, Nov 24 '22 at 12:59

Build a dynamic index and then extract specific pages of PDF and save it with Python

0 Answers0