Text Mining from PDF file using Python

Question

i have annual report of a company(in .pdf format) and i want to fetch balance sheet and other related report form annual report using python. i tried with PyPDF2 lib but it is extracting highly unstructured text. is there any way??

answer is: "there is always a way", now, can you be more specific and add sample of your data and snippet of code for us to see what you've tried and what went wrong — Azat Ibrakov, Sep 01 '18 at 05:11
data at https://s3-ap-southeast-1.amazonaws.com/bsy/iportal/images/Annual-report-2017-18_324BCC06D8C6765F2F6C750DD9CD8C63.pdf and i want to Fetch Balance sheet at random page. — PRAYANK, Sep 04 '18 at 06:09

score 0 · Answer 1 · answered Sep 01 '18 at 06:42

0

You should use textract

https://github.com/deanmalmgren/textract

It supports various file types for text extraction.

answered Sep 01 '18 at 06:42

Richard Rublev

7,718
16
77
121

score 0 · Answer 2 · edited Oct 19 '18 at 08:29

Your question is not very clear. I understand it as I’ve done a lot of work on extracting from UK annual reports. To explain to others, what you’re asking for sounds straightforward where in reality it’s a nightmare. Annual reports come in PDF format and none of the firms producing them follow any standard which makes it difficult to analyse thise reports even manually. PDFs loose structure when you convert them to text. I have a java tool that reads and detects the structure of UK PDF annual reports (similar to the one your provided in the link). It took me 5 years to come up with a solution that can process up to 95% of all UK annual reports despite the huge differences between them. Have a look: https://github.com/drelhaj/CFIE-FRSE there are links there to papers on how we did it.

Text Mining from PDF file using Python

2 Answers2