Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!