Remove header and footer from pdftotext module in Python

Question

I am using pdftotext python package to extract text from pdf however I need to remove headers and footers from the text file to extract only the content.

There could be two ways to solve this :

Using regular expressions in text file
Using some filter while getting text from pdf

Now, the current problem is headers and footers being inconsistent with pages.

For example, the first 1-2 lines of header might have contractor's address which is consistent but 3rd line of the header has section and the topic which the page is following. Similarly footer consists of project number(not a fixed number value either), subsection number and some design words followed by a date which should be consistent (but different for every project). It should also be noted that the pdf file can be 500+ pages for every project but probably splitting will be done based on sections.

Currently I'm using this code to extract information. Are there any parameters I don't know about which can be used to remove headers and footers?

import pdftotext

def get_data(pdf_path):

    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)

    print("Pages : ",len(pdf))

    with open('text-pdftotext.txt', 'w') as k:
        k.write("\n\n".join(pdf))

    f.close()
    k.close()

get_data('specification_file.pdf')

score 1 · Answer 1 · answered May 07 '22 at 02:52

pdftotext is best used as designed i.e. as a command line via any shell.

So to remove page break headers and footers use the command exactly as it was designed to be run.

pdftotext -nopgbrk -margint <number> -marginb <number> filename

with xpdf 4.04 that will give you the body text without the toplines and without the bottom lines.

If using the Poppler variant you need to set a region of interest with

  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)

EBo · Answer 2 · 2022-05-07T00:48:18.567

I had the same issue with converting automatically generated project planning PDFs that I wanted to strip the page breaks from the text before emailing the results.

What I did was to use regular expressions to match all the numbered page breaks, and write out the non-matching portions of the input. Here is the complete code for a little utility script I threw together in 10 min:

#!/usr/bin/env python

import sys
import re
import argparse

parser = argparse.ArgumentParser()

parser.add_argument("--infile", "-i", type=str, default=None,
                    help="input file (default: %(default)s).")
parser.add_argument("--outfile", "-o", type=str, default=None,
                    help="output file (default: %(default)s).")

parser.add_argument("--fmt", "-f", type=str, default="\d\n\n",
                    help="the footer search format (default: %(default)s).")

args = parser.parse_args()

try:
    # open an input filr (use STDIN as default)
    fin = sys.stdin
    if args.infile:
        fin = open(args.infile,'r')

    # read in the entire file in one gulp, and close it.
    fstr = fin.read()
    fin.close()

    # open up the output file (use STDIN as default)
    fout = sys.stdout
    if args.outfile:
        fout = open(args.outfile,'w')

    # spin through all the matches and 
    last = 0
    for match in re.finditer(args.fmt, fstr, re.DOTALL):
        start,end = match.span()

        # write out everthing before the matched string since last match.
        fout.write(fstr[last:start])
        last = end

    # write out remaining text at the end of the file and close.
    fout.write(fstr[last:])
    fout.close()

# simple exception handling for file not found, etc.
except Exception as er:
    print(er)

I am sure that others can suggest things to clean up here, introspection documentation, and more, but it works for me as needed.

Please note that this script reads the input text file as a single string for simplicity. This is probably not appropriate for a 500 page file, but you can rewrite the reader to work on blocks, but you have to make sure that the page breaks do not happen at one of the block boundaries. Other than that the code provided should get you close.

score -1 · Answer 3 · answered May 13 '21 at 09:16

One answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. This way you'd be able to crop the header end the footer with opencv to keep only the core of your file. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.

I drop some code down here if you are interested.

First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.

To convert PDFs to images using pdf2image:

Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'

def pdftoimg(fic,output_folder, poppler_path):
    # Store all the pages of the PDF in a variable 
    pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path) 

    image_counter = 0

    # Iterate through all the pages stored above 
    for page in pages: 
        filename = "page_"+str(image_counter)+".jpg"
        page.save(output_folder+filename, 'JPEG') 
        image_counter = image_counter + 1
        
    for i in os.listdir(output_folder):
        if i.endswith('.ppm'):
            os.remove(output_folder+i)

Crop the image footer and header:

I do not know the size of your footers and headers but by trying to crop your image multiple times you should be able to find the right dimensions to use. You'll then be able to do crop your image to keep the body of your PDF by using OpenCV crop method new_head being the value of the top pixel on the y axis below the header and new_bottom being the bottom pixel on the y axis where the footer starts.

def crop_img(fic, output_folder):
    img = cv2.imread(fic)
    shape = img.shape
    crop_img = img[new_head:new_bottom, 0:shape[1]]
    cv2.imwrite(output_folder+name, crop_img)

To extract text from the image:

Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def imgtotext(img, tesseract_path):
    # Recognize the text as string in image using pytesserct 
    pytesseract.pytesseract.tesseract_cmd = tesseract_path
    text = str(((pytesseract.image_to_string(Image.open(img))))) 
    text = text.replace('-\n', '')
    
    return text

Thank you for the effort but as I said, it is a 500+ pages pdf and converting them all to images is not a good idea (not to mention the accuracy with which pdftotext is able to achieve with text-extraction). — Raghav Gupta, May 13 '21 at 09:38

Remove header and footer from pdftotext module in Python

3 Answers3