pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

Question

I have a pdf that contains text and tables. I want to extract both of them but when I used the extract_text function it also extracts the content which is inside of the table. I just want to only extract the text which is outside the table and the table can be extracted with the extract_tables function.

I have tested with a pdf that only contains tables but still extract_text extracts also the table contents which I want to extract using extract_tables function.

score 1 · Answer 1 · answered Oct 08 '21 at 17:32

You can try with the following code

import pdfplumber

# Import the PDF.
pdf = pdfplumber.open("file.pdf")

# Load the first page.
p = pdf.pages[0]

# Table settings.
ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]

def not_within_bboxes(obj):
    """Check if the object is in any of the table's bbox."""
    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())

I am using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables and creating a filtered version of the page and then extracting the text from it.

Since you haven't shared the PDF, the table settings I have used may not work but you can change them to suit your needs.

pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

1 Answers1