-3

i have annual report of a company(in .pdf format) and i want to fetch balance sheet and other related report form annual report using python. i tried with PyPDF2 lib but it is extracting highly unstructured text. is there any way??

PRAYANK
  • 57
  • 1
  • 9
  • 2
    answer is: "there is always a way", now, can you be more specific and add sample of your data and snippet of code for us to see what you've tried and what went wrong – Azat Ibrakov Sep 01 '18 at 05:11
  • data at https://s3-ap-southeast-1.amazonaws.com/bsy/iportal/images/Annual-report-2017-18_324BCC06D8C6765F2F6C750DD9CD8C63.pdf and i want to Fetch Balance sheet at random page. – PRAYANK Sep 04 '18 at 06:09

2 Answers2

0

You should use textract

https://github.com/deanmalmgren/textract

It supports various file types for text extraction.

Richard Rublev
  • 7,718
  • 16
  • 77
  • 121
0

Your question is not very clear. I understand it as I’ve done a lot of work on extracting from UK annual reports. To explain to others, what you’re asking for sounds straightforward where in reality it’s a nightmare. Annual reports come in PDF format and none of the firms producing them follow any standard which makes it difficult to analyse thise reports even manually. PDFs loose structure when you convert them to text. I have a java tool that reads and detects the structure of UK PDF annual reports (similar to the one your provided in the link). It took me 5 years to come up with a solution that can process up to 95% of all UK annual reports despite the huge differences between them. Have a look: https://github.com/drelhaj/CFIE-FRSE there are links there to papers on how we did it.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
PhDeveloper
  • 335
  • 1
  • 4
  • 14