I got this part of code:
kwfile = fitz.open(filedialog.askopenfilename()) # the keywords PDF
# the following extracts kwfile content as plain text across all pages:
text = " ".join([page.get_text() for page in kwfile])
keywords = text.replace("\n", " ").split() # make keywords list
keywords = list(set(keywords))
doc = fitz.open(filedialog.askopenfilename()) # open PDF with pymupdf
for page in doc: # loop through the page of the PDF
words = page.get_text("words") # extract page text by single words
for word in words:
if word[4] in keywords: # item 4 contains actual word text string
page.add_highlight_annot(word[:4]) # highlight the word
doc.save("markedwords.pdf")
This code needs two PDF files. One is a keyword PDF and the other one is the original PDF. If you run this code it compares both and searches for the keywords in the original PDF. At the end it creates a copy of the original PDF but with all the words it has found marked in yellow.
Now I need help in something: Is it possible to exclude words, words which mustn't marked? Because sometimes there are words like "the", "for", "and", "but", which are marked, but I do not want these words to be marked.