2

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.

Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):

  1. Introduction
  2. New Features
    2.1. New Feature 1
    description
    2.2 New Feature 2
    description
    .
    .
    .
    2.x) New Feature X description
  3. Defect fixes
    description
    table with defect descriptions

rest of the document is irrelevant in this case

I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?

import fitz

filename = '~\releasenotes.pdf'

doc = fitz.open(filename)
print (doc) #  Just to see what comes out

(and now what should I do next ?)

MMG
  • 3,226
  • 5
  • 16
  • 43
Isaac
  • 27
  • 4

1 Answers1

0

A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)

import re #regular expression library

doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''

ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))

Let me unpack that last line: re.findall will produce a list of items in your document that matches the search string '2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.

Hope this fits the bill?

DaveB
  • 452
  • 2
  • 7
  • Fantastic, thanks, that pushed me in the right direction, I figured seeing as it was a pdf, I would then convert it into HTML first (it seemed easier) and then as suggested regex search the sections, which worked (after much regex magic). I have manged to then BeautifulSoup the html files and save the bits that I want to a dataframe. The challenge now is to get it to read all converted files in the directory, for some reason it only takes the first - still a good step forward, Thanks !! – Isaac Sep 01 '20 at 17:30