I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in.
I tried with some common python packages like PyPDF2, pdfplumber or pdfreader.The problem is that apparently they extract all the text present in the pdf, including those parts listed above in which I am not interested.
As an example:
from PyPDF2 import PdfReader
file = PdfReader(file)
page = file.pages[10]
text = page.extract_text()
This code will get me the whole text from page 11, including footers, box, text from a table and the number of the page, while what I would like is only the core text.
Unluckily the only solution I found up to now is to copy paste in another file the core text.
Is there any method/package which can automatically recognize the main text from the other parts of the pdf and return me only that?
Thank you for your help!!!