Currently I try to extract the Combined Management Report from the following annual report: https://www.merckgroup.com/content/dam/web/corporate/non-images/investors/reports-and-financials/earnings-materials/2021-q4/en/2021-Q4-Report-EN.pdf
I transformed the pdf into text: https://www.file-upload.net/download-15044794/MerckKGaA03-MAR-2022FullYear66228323.txt.html
Now I want to extract page 14-228(the whole chapter Management Report and Corporate Governance).
Currently I try to extract the Combined Management Report from the following annual report: https://www.merckgroup.com/content/dam/web/corporate/non-images/investors/reports-and-financials/earnings-materials/2021-q4/en/2021-Q4-Report-EN.pdf
I transformed the pdf into text: https://www.file-upload.net/download-15044794/MerckKGaA03-MAR-2022FullYear66228323.txt.html
Now I want to extract page 14-228(the whole chapter Management Report and Corporate Governance).
Currently I work with two commands, the first indicating the start position, and the second the "end position".
item7_regex=re.compile(r"Fundamental Information about",re.DOTALL)
item8_regex=re.compile(r"Consolidated\s+Financial\s*Statements",re.DOTALL)
And to extract the code i use:
def extract_mdna(plain_text:str, min_mdna_length=25000):
section_start_match= item7_regex.search(plain_text)
if section_start_match:
section_start_pos=section_start_match.start()
section_end_pos=section_start_pos
while(section_end_pos<section_start_pos + min_mdna_length) and section_end_pos!=-1:
section_end_match=item8_regex.search(plain_text,section_end_pos+1)
if section_end_match:
section_end_pos=section_end_match.start()
else:
section_end_pos=-1
if section_end_pos>0:
item7_text=plain_text[section_start_pos:section_end_pos]
return item7_text
return None
for x in fileList:
pepsi = open(r"C:\Users\hp\Desktop\Research\test\Merck KGaA 03-MAR-2022 Full Year 66228323.txt", 'r',encoding='utf-8').read()
nur = extract_mdna(pepsi)
print(nur)
It gives me the Table of Content, and not the actual chapter.
Now I found the following thread: Extract specific pages of PDF and save it with Python
It uses this variable:
information = [(filename1,startpage1,endpage1), (filename2, startpage2, endpage2), ...,(filename19,startpage19,endpage19)]
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("example.pdf")
for page in range(len(information)):
pdf_writer = PyPDF2.PdfFileWriter()
start = information[page][1]
end = information[page][2]
while start<=end:
pdf_writer.addPage(pdfReader.getPage(start-1))
start+=1
if not os.path.exists(savepath):
os.makedirs(savepath)
output_filename = '{}_{}_page_{}.pdf'.format(information[page][0],information[page][1], information[page][2])
with open(output_filename,'wb') as out:
pdf_writer.write(out)
It does work fine. My question is, is there a way to build a command that yields the information variable based on my files?
Coming back to my example, I have the table of content:
Fundamental Information about the Group
with respect to its Composition and Profile of Skills and Expertise
14
Merck
24
Strategy
31
Internal Management System
229 Consolidated Income Statement
38
Research and Development
230 Consolidated Statement of Comprehensive Income
51
Report on Economic Position
231 Consolidated Balance Sheet
51
232 Consolidated Cash Flow Statement
Macroeconomic and Sector-Specific Environment
55
Review of Forecast against Actual Business Development
87
63
Course of Business and Economic Position
63
Merck Group
73
Life Science
77
Healthcare
82
Electronics
86
Corporate and Other
Report on Risks and Opportunities
104 Report on Expected Developments 108 Report in accordance with section 315a of the German Commercial Code (HGB)
233 Consolidated Statement of Changes in Net Equity 234 Notes to the
Is there a way to automatically extract the page number from the TOC, i.e. the number to the left of the chapter name, and parse it automatically into the information variable?