What is the best way to extract the body of an article with Python?

Question

Summary

I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.

What I Want to Achieve

I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.

Issues

I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:

from pdfminer import high_level

# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf): 
    article_text = high_level.extract_text(pdf)
    sents = article_text.split('.') #splitting on '.', roughly splits on every sentence      
    run_ave = 0
    
    for s in sents:
        run_ave += len(s)
    run_ave /= len(sents)
    sents_strip = []
    
    for sent in sents:
        if len(sent.strip()) >= run_ave:
            sents_strip.append(sent)

    return sents_strip

Note: I am using this article as input.

Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.

Appeal

Are there ways I can improve the performance of this parser and make it more consistent?

Thank you for your answers!

The output what you are expecting cannot be done by few lines of coding, the things you want to exclude like title, abstract, images, equations needed to be checked manually before doing OCR. You need to use optical character recognition applications to do your work — Srinath Neela, Aug 18 '20 at 17:21
@SrinathNeela, are there any other ways to do this sort of thing and get "pretty good" accuracy? I am not looking for something perfect at this point. — mdave1701, Aug 19 '20 at 17:41

What is the best way to extract the body of an article with Python?

0 Answers0