I'm in the process of creating a pdf skimmer that reads a legal document, searches for keywords, returns the individual sentences that the keywords are apart of, then updates a checklist based on the conditions of the returned sentences.
All the pdfs I'm working with are legal documents so they are standardized, no imagines, no special characters.
The current script that I have allows the user to enter multiple keywords, then the script will return the page number that each keyword is on. I now need the script to return the full sentence that contains the word. From searching around it seems that I will have to use a delimiter to return the contents between two periods. The code below is what I currently have.
Any and all help is appreciated!
import fitz
import re
search_word = ("x", "y", "z")
results = {}
docs = fitz.open("some_pdf.pdf")
for page in docs:
word = [w[4].lower() for w in page.get_text("word")]
for sword in search_words: ###loops through search list###
for word in words: ### a search word is part of a word on the page
if sword in words:
pages = results.get(sword, set()) ### gets a set of page numbers so far
page.add(page.number) ### adds a page number
resultsp[sword] = pages ### writes back to results
###REPORT OF RESULTS###
for word in results:
results = list(map(str, results[word])) ### returns set of page numbers
page_list = ",".join(results) ### adds a comma to separate page numeber
print("word '%s' occurs on pages %s. % (word, page_list))
###OUTPUT###
word 'x' occurs on pages n, n+1, n+2
word 'y' occurs on pages n, n+4, n+6
word 'z' occurs on pages n+25, n+20