4

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.

Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.

How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.

The code snippet is as follows:

import PyPDF2

def Read(startPage, endPage):
    global text
    text = []
    cleanText = " "
    pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    print(num_pages)
    while (startPage <= endPage):
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
           cleanText += myWord
    text = cleanText.strip().split()
    print(text)

Read(1, 1)
M S
  • 894
  • 1
  • 13
  • 41

2 Answers2

4

As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list. You can run this function for each page.

def remove_header_footer(self,pdf_extracted_text):
        page_format_pattern = r'([page]+[\d]+)'
        pdf_extracted_text = pdf_extracted_text.lower().split("\n")
        header = pdf_extracted_text[0].strip()
        footer = pdf_extracted_text[-1].strip()
        if re.search(page_format_pattern, header) or header.isnumeric():
            pdf_extracted_text = pdf_extracted_text[1:]
        if re.search(page_format_pattern, footer) or footer.isnumeric():
            pdf_extracted_text = pdf_extracted_text[:-1]
        pdf_extracted_text = "\n".join(pdf_extracted_text)
        return pdf_extracted_text

Hope you find this helpful.

2

At the moment, pypdf (and the deprecated PyPDF2) does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf

As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents

You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages

Side note: I'm the maintainer of pypdf and PyPDF2 and I think this will never be inside pypdf. The reason is that it cannot be done reliably. You need some context knowledge. That makes it a good fit for machine learning, but not such a good fit for a library. People would not be happy if it worked just 80% of the time + we would constantly have to extend this.

Ideas how to identifiy the footer

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958