Remove the garbage words from the pdf

Question

I am extracting the pdf to text using python and libraries like, fitz, pdfreader and so on. But in my pdf, there are some schematics and words I do not need on it.

Here is an example.

When extracting the text, the words of the schematics are also included, but I do not want those words to appeare. Because if the image can be extrated the text in the images is not meaninful.

I could not come up with a strategy to delete these useless words from the pdf.

import fitz
from io import BytesIO

class DeleteGarbage(object):
    def __init__(self, max_table_area=1.5):
        self.max_table_area = max_table_area

    def process(self, context):
        '''extract page content and does basic filtering using fitz'''
        for page_number, page in enumerate(context["fitz"]):
            if page_number != 2:
                continue
            area_of_page = page.rect.width * page.rect.height
            paths = page.get_drawings()  # extract existing drawings
            
            for path in paths:
                for item in path["items"]:
                    if item[0] == "l":  # line
                        rect = [item[1][0], item[1][1], item[2][0], item[2][1]]
                        if self.check_if_not_table(rect, page_number, context['content']['pages'][page_number - 1]['tables']):
                            rect = [item[1][0] - 10, item[1][1] - 10, item[2][0] + 10, item[2][1] + 10]
                            white = (1, 1, 1)
                            black = (0, 0, 0)
                            page.add_redact_annot(rect, f"", align=fitz.TEXT_ALIGN_CENTER, fill=white, text_color=white)
                    elif item[0] == "re":  # rectangle
                        rect = item[1]
                        if rect.get_area() < area_of_page / self.max_table_area and self.check_if_not_table(rect, page_number, context['content']['pages'][page_number - 1]['tables']):
                            white = (1, 1, 1)
                            black = (0, 0, 0)
                            page.add_redact_annot(
                                [rect[0] - 10, rect[1] - 10, rect[2] + 10, rect[3] + 10],
                                f"",
                                align=fitz.TEXT_ALIGN_CENTER,
                                fill=white,
                                text_color=white
                            )

            page.apply_redactions()
        return context
    def check_if_not_table(self, rect, page_number, tables):
        for table_coordination in tables['coordination']:
            if table_coordination[0] - 10 < rect[0] and table_coordination[1] - 10 < rect[1] and table_coordination[2] + 10 > rect[2] and table_coordination[3] + 10 > rect[3]:
                return False
        return True

What you call "schematics" are probably vector graphics. And PyMuPDF can extact them. If you absolutely don't want the text appearing inside vector graphics areas, you can extract the graphics areas and omit text appearing there. Or use PDF redactions to remove them. — Jorj McKie, Aug 30 '23 at 10:37
The issue is the text can be upper side of the line, down side of the line, or inside the square. Too many possibilities appers — Muhammad Samadzade, Aug 30 '23 at 11:00

K J · Answer 1 · 2023-08-31T00:30:58.640

Your strategy is reasonable but the problem with many similar documents like that is that contents are often all over the place so we can see the extracted heading area is actually the last contents written in the body text.

One way would be to draw redaction areas to remove the unwanted upper searchable graphics section. but that is often more work than select the desired section so lets concentrate on the tabular layout. It could just as easily be two columns etc.

What we need is a profile for the page extraction thus in this case we want for page 3 the area as defined here.

So we can build a list of desires per page and then run all as one script to output all in good order.

For an example of 2 columns per page see https://stackoverflow.com/a/77008749/10802527 where with a few adjustments that page profile could be used on page 1 (shown below) using

for left -x 0 -y 110 -W 300 -H 700
& right -x 300 -y 110 -W 300 -H 400

since its smaller only the right half is seen here on the console, but you will be redirecting outputs to an output file.txt or similar.

If you take a batch of desires and write a command modular you could simply write (consider adding ranges of similar pages rather than singles):-

pdfEXfunc file.pdf 2col 1 110 700 400 // for split page 1
pdfEXfunc file.pdf 2col 2 100 200 200 // for page 2 TOC
pdfEXfunc file.pdf 1col 2 300 200     // for page 2 REVisions
pdfEXfunc file.pdf 1col 3 270 250     // for full width page 3
pdfEXfunc file.pdf etc etc.

Thanks for the detailed answer. But if I have 200 hundred pdf and can not know the split coordinate in each of them. In that case, what do you suggest me to do? — Muhammad Samadzade, Aug 31 '23 at 08:02

Remove the garbage words from the pdf

1 Answers1