Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Question

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.

Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.

How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.

The code snippet is as follows:

import PyPDF2

def Read(startPage, endPage):
    global text
    text = []
    cleanText = " "
    pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    print(num_pages)
    while (startPage <= endPage):
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
           cleanText += myWord
    text = cleanText.strip().split()
    print(text)

Read(1, 1)

score 4 · Answer 1 · answered Jul 31 '22 at 10:31

As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list. You can run this function for each page.

def remove_header_footer(self,pdf_extracted_text):
        page_format_pattern = r'([page]+[\d]+)'
        pdf_extracted_text = pdf_extracted_text.lower().split("\n")
        header = pdf_extracted_text[0].strip()
        footer = pdf_extracted_text[-1].strip()
        if re.search(page_format_pattern, header) or header.isnumeric():
            pdf_extracted_text = pdf_extracted_text[1:]
        if re.search(page_format_pattern, footer) or footer.isnumeric():
            pdf_extracted_text = pdf_extracted_text[:-1]
        pdf_extracted_text = "\n".join(pdf_extracted_text)
        return pdf_extracted_text

Hope you find this helpful.

Martin Thoma · Answer 2 · 2023-07-07T06:19:01.040

At the moment, pypdf (and the deprecated PyPDF2) does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf

As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents

You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages

Side note: I'm the maintainer of pypdf and PyPDF2 and I think this will never be inside pypdf. The reason is that it cannot be done reliably. You need some context knowledge. That makes it a good fit for machine learning, but not such a good fit for a library. People would not be happy if it worked just 80% of the time + we would constantly have to extend this.

Ideas how to identifiy the footer

Go by the position. Just define a threshold under which you assume the footer is. Then you can use visitor functions: https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html#using-a-visitor
Try to find text patterns which are on every page at the bottom.

any opensource already developed solution that you can share please ? — famas23, Jul 05 '23 at 11:00
No, there is none. But its also not super hard to implement it yourself. — Martin Thoma, Jul 05 '23 at 11:02
What about the visitor-functions in pypdf's documentation? Always thought that part is a bit hard to read/implement — Dinoman, Jul 07 '23 at 03:58

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

2 Answers2

Ideas how to identifiy the footer