Find and mark words in a PDF EXCEPT some words python

Question

I got this part of code:


kwfile = fitz.open(filedialog.askopenfilename())  # the keywords PDF

    # the following extracts kwfile content as plain text across all pages:
    text = " ".join([page.get_text() for page in kwfile])
    keywords = text.replace("\n", " ").split()  # make keywords list

    
    keywords = list(set(keywords))
    doc = fitz.open(filedialog.askopenfilename())  # open PDF with pymupdf
    for page in doc:  # loop through the page of the PDF
        words = page.get_text("words")  # extract page text by single words
        for word in words:
            if word[4] in keywords:  # item 4 contains actual word text string
                page.add_highlight_annot(word[:4])  # highlight the word

doc.save("markedwords.pdf")

This code needs two PDF files. One is a keyword PDF and the other one is the original PDF. If you run this code it compares both and searches for the keywords in the original PDF. At the end it creates a copy of the original PDF but with all the words it has found marked in yellow.

Now I need help in something: Is it possible to exclude words, words which mustn't marked? Because sometimes there are words like "the", "for", "and", "but", which are marked, but I do not want these words to be marked.

Without knowing the content of the keyword PDF file it is basically impossible to help you. Which words are you trying to mark yellow? — felix, Feb 09 '23 at 09:04
The keywords file could be any content. This program marks every word from the keyword file in the original file if it finds some. The user can decide which keywords he wants to search for. — Furk276, Feb 09 '23 at 09:29
I understand that, I mean your specific case. If you have "the", "for" and so on in the file they would obviously be marked but I assume they aren't in there. The code you provided doesn't show problems like marking three letter words or words not in the file, so for your specific case we would need your specific keyword list to help. — felix, Feb 09 '23 at 10:05
Yes, but there is this problem that the code marks words which contain other words. As an example of my file I can show you these: Tools or directory -> marked word is "to", because these words contain the word "to" (TOols, direcTOry) — Furk276, Feb 09 '23 at 10:11
Looks like you simply have to go over the keywords list and remove stuff you do not want. Use e.g. Python's `filter` function `keywords = list(filter(function, set(keywords)))`. The `function` will be called with every item of the set and returns True or False to indicate exclusion or inclusion. Put some knowledge in `function` like "no 3-letter words", no "else", no ... whatever. — Jorj McKie, Feb 09 '23 at 14:34

score 0 · Answer 1 · answered Feb 09 '23 at 09:19

disclaimer I am the author of borb, the library used in the answer

I would split the problem in 3 parts:

get the text from a PDF
decide which words you'd like to mark
mark those words in the PDF

Step 1: Get the text from a PDF

import typing
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
from pathlib import Path

def get_text_from_pdf(p: Path) -> str:
    """
    This function returns the complete text from a PDF,
    where the text on separate pages is separated by a newline character
    """
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open(p, "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])
    if doc is None:
        return ""
    number_of_pages: int = int(pdf.get_document_info().get_number_of_pages() or 0)
    return "".join([l.get_text()[i] + "\n" for i in range(0, number_of_pages)])

Step 2: Decide which words you would like to mark

This is up to you. There are various algorithms to decide which words are keywords in a document. Some of them are even implemented in borb already.

You can find those here.

There are also GitHub repositories with plaintext files containing taboo/stopwords. You can include such a list in your code to avoid marking words like "for" and "the".

Step 3: Mark those words in the PDF

Marking words (or any content really) in a PDF can be done using so called "annotations". You can think of an annotation as "anything you would add after creation to an existing document".

Annotations can be:

geometric shapes
text
sound
video
links inside/outside the document itself
and more

You could also (but this is significantly harder) modify the page itself, such that rather than drawing text at a given location, you add instructions to first draw a highlighter-colored box underneath the text.

If you want more information about adding annotations to a PDF, you can find it here.

Find and mark words in a PDF EXCEPT some words python

1 Answers1

Step 1: Get the text from a PDF

Step 2: Decide which words you would like to mark

Step 3: Mark those words in the PDF