1

I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I program pdfplumber to not read the page headers(titles) and the page numbers(or the footer, if possible) ?

Here is code :

import pdfplumber

all_text = ""

pdf = pdfplumber.open(file)
for pdf_page in pdf.pages:
    one = pdf_page.extract_text()
    all_text = all_text + '\n' + str(one)
    print(all_text)

where file is the PDF Document...

Anandakrishnan
  • 349
  • 5
  • 10

2 Answers2

4

I don't think you can.

However, you can crop the document with the crop method. This way, you can extract the text only for the cropped part of page, leaving out headers and footers. Of course this method requires that you know in advance the height of headers and footers.

Here is the explanation of coords:

x0 = % Distance of left side of character from left side of page.
top = % Distance of top of character from top of page.
x1 = % Distance of right side of character from left side of page.
bottom = % Distance of bottom of the character from top of page.

Here is the code:

# Get text of whole document as string
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page in enumerate(pdf.pages):
        my_width = page.width
        my_height = page.height
        # Crop pages
        my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
        page_crop = page.crop(bbox=my_bbox)
        text = text+str(page_crop.extract_text()).lower()
        pages.append(page_crop)
SilentCloud
  • 1,677
  • 3
  • 9
  • 28
  • 1
    The crop arguments should to be updated, https://github.com/jsvine/pdfplumber ` x0 Distance of left side of character from left side of page. x1 Distance of right side of character from left side of page. top Distance of top of character from top of page. bottom Distance of bottom of the character from top of page. ` – fitz Sep 24 '21 at 13:17
  • Done, I will leave the `%` on purpose because it look more clear to me – SilentCloud Sep 24 '21 at 15:22
0

We can try using a regex expression of both header and footer, if your documents contain similar expressions. In my documents, date and page numbers were common, thus used the following code.

    with pdfplumber.open(pdf_upload) as pdf:
        for page in pdf.pages :
            text += page.extract_text()
    footer_pattern = '(page|Page|PAGE)\s*\d+\s*(of|OF|Of)\s*\d+'
    header_pattern =r'(January|February|March|April|May|June|July|August|September|October|November|December|january|february|march|april|may|june|july|august|september|october|november|december|JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)\s*\d{2}\s*(\,|\s)\s*\d{4}'

Footer pattern was like - Page 1 of 20 header pattern was January 20, 2021 Such patterns will be removed by replacing these patterns with a space

    replace = ''
    texxt = re.sub(pattern, replace, text)
    texxt = re.sub(date_pattern, replace, texxt)
    texxt = re.sub('\s{2}',' ',texxt)

Now this will atleast remove the header/footer patterns from text file.