I am trying to build a pdf crawler for annual reports of corporates - these reports are pdf documents with a lot of text and also a lot of tables.
I don't have any trouble with converting the pdf into a txt, but my actual goal is to search for certain keywords (for example REVENUE, PROFIT) and extract the data Revenue 1.000.000.000€ into a data frame.
I tried different libraries, especially tabula-py and PyPDF2 but I couldn't find a smart way to do that - can anyone please help with a strategy, it would be amazing!
Best Regards, Robin