0

I'm in the process of creating a pdf skimmer that reads a legal document, searches for keywords, returns the individual sentences that the keywords are apart of, then updates a checklist based on the conditions of the returned sentences.

All the pdfs I'm working with are legal documents so they are standardized, no imagines, no special characters.

The current script that I have allows the user to enter multiple keywords, then the script will return the page number that each keyword is on. I now need the script to return the full sentence that contains the word. From searching around it seems that I will have to use a delimiter to return the contents between two periods. The code below is what I currently have.

Any and all help is appreciated!

import fitz
import re 

search_word = ("x", "y", "z") 
results = {} 
docs = fitz.open("some_pdf.pdf")


for page in docs: 
     word = [w[4].lower() for w in page.get_text("word")]
     for sword in search_words: ###loops through search list###
          for word in words: ### a search word is part of a word on the page
               if sword in words: 
                    pages = results.get(sword, set()) ### gets a set of page numbers so far
                    page.add(page.number) ### adds a page number
                    resultsp[sword] = pages ### writes back to results

###REPORT OF RESULTS###
for word in results: 
     results = list(map(str, results[word])) ### returns set of page numbers 
     page_list = ",".join(results) ### adds a comma to separate page numeber 
     print("word '%s' occurs on pages %s. % (word, page_list))

###OUTPUT### 

word 'x' occurs on pages n, n+1, n+2 
word 'y' occurs on pages n, n+4, n+6  
word 'z' occurs on pages n+25, n+20




1 Answers1

0

there! It seems like I have similar task sometime ago.

import textract

def pdf_cleaner(pdf):
    paragraphes = pdf.decode("utf-8").split("\n\n")
    parsed_pdf = list()
    for paragraph in paragraphes:
        token = paragraph.replace("\n", " ").replace(":", "").replace(";", "").replace(",", "").replace("►", "").replace("\x0c", "").lower().split(".")
        parsed_pdf.extend(token)
    return parsed_pdf

def get_example(text, word):
    for sentence in text:
        if word in sentence.split(" "):
            break
    return sentence



text = textract.process('9BMN0W-chowdhuryIR1.pdf', method='pdfminer')
pdf = pdf_cleaner(text)
words = ["some", "words", "to", "find"] 
examples = [get_example(pdf, word) for word in words]
Fedor
  • 19
  • 4
  • Thank you for the reply. This does look promising. I am getting the following error: "expected str, byte or os.PathLike object, NoneType." Do you think this is related to the pathing of the file? – Ravi Thehidden Aug 23 '23 at 20:14
  • I don't think so, but doublecheck that the file are existing. Please provide more information about the error. Maybe you can try to run `text = textract.process('9BMN0W-chowdhuryIR1.pdf', method='pdfminer') pdf = pdf_cleaner(text) words = ["some", "words", "to", "find"] examples = [get_example(pdf, word) for word in words]` this lines separately, to find the bug – Fedor Aug 24 '23 at 08:03