0

I am trying to build a pdf crawler for annual reports of corporates - these reports are pdf documents with a lot of text and also a lot of tables.

I don't have any trouble with converting the pdf into a txt, but my actual goal is to search for certain keywords (for example REVENUE, PROFIT) and extract the data Revenue 1.000.000.000€ into a data frame.

I tried different libraries, especially tabula-py and PyPDF2 but I couldn't find a smart way to do that - can anyone please help with a strategy, it would be amazing!

Best Regards, Robin

rbnspckrs
  • 1
  • 1
  • Hi there, can you provide some examples of code you have tried. – Michelle Jun 20 '20 at 06:17
  • @rbnspckrs pdfminer.six is great for this! You can use it to parse the text into boxes with coordinates. Then look for your key words in the boxes, and for your data in the boxes close to your key words. Validate with regex. – jkortner Jun 22 '20 at 15:06

1 Answers1

0

Extracting data from PDFs is tricky business. Although there are PDF standards , not all PDFs are created equal. If you can already extract the data you need in text form, you can use RegEx to pull the data you require.

Amazon have a machine learning tool called Textract which you can use alongside their boto3 SDK in Python. However, this is a 'pay-per' service. The main difference with using Textract to regular expressions is that Textract can recognise and format data pairs and tables which should mean that creating your 'crawler' is quicker and less prone to breaking if your PDFs change going forward.

There is a Python package called Textract but it's not the same as the one provided in AWS, rather, it's a wrapper that (for PDFs) uses pdftotext (default) or pdfminer.six. It's worth checking it out as it may yield your data in a better format.

Lucan
  • 2,907
  • 2
  • 16
  • 30