-2

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page.

Now I have the following data:

enter image description here

I was wondering if there is a way for me to get the specific article on the specific page using this data. I know now which filenames to check and which page.

Thanks a lot.

Mtrinidad
  • 157
  • 1
  • 11

1 Answers1

1

There is a library called tika. It can extract the text from a single page. You can split your pdf in such a way, that you have only the page in question still available. Then you can use:

parsed_page = parser.from_file('sample.pdf')
print(parsed_page['content'])

NOTE: This library requires Java to be installed on the system

NameKhan72
  • 717
  • 4
  • 11