Grabbing an article from a pdf file - Python

Question

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.

Now I have the following data:

I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.

Thanks a lot.

score 1 · Accepted Answer · answered Mar 16 '21 at 23:36

There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:

parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])

NOTE: This library requires Java to be installed on the system

Grabbing an article from a pdf file - Python

1 Answers1