We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.
Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):
- Introduction
- New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description - Defect fixes
description
table with defect descriptions
rest of the document is irrelevant in this case
I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?
import fitz
filename = '~\releasenotes.pdf'
doc = fitz.open(filename)
print (doc) # Just to see what comes out
(and now what should I do next ?)