Python scraping an unstructured PDF

Question

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.

Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):

Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
Defect fixes
description
table with defect descriptions

rest of the document is irrelevant in this case

I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?

import fitz

filename = '~\releasenotes.pdf'

doc = fitz.open(filename)
print (doc) #  Just to see what comes out

(and now what should I do next ?)

score 0 · Accepted Answer · answered Sep 01 '20 at 12:23

A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)

import re #regular expression library

doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''

ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))

Let me unpack that last line: re.findall will produce a list of items in your document that matches the search string '2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.

Hope this fits the bill?

Fantastic, thanks, that pushed me in the right direction, I figured seeing as it was a pdf, I would then convert it into HTML first (it seemed easier) and then as suggested regex search the sections, which worked (after much regex magic). I have manged to then BeautifulSoup the html files and save the bits that I want to a dataframe. The challenge now is to get it to read all converted files in the directory, for some reason it only takes the first - still a good step forward, Thanks !! — Isaac, Sep 01 '20 at 17:30

Python scraping an unstructured PDF

1 Answers1