-1

I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code. There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:

  1. Navigate to relevant section of the pdf
  2. Extract images of the tabular data
  3. Extract data from the images, format and convert to dataframes.

Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.

I would appreciate some help in this regard. What packages should I use? Is my approach correct? Can I get references to any helpful code snippets for similar problems?

page structure of the required tables

maaza
  • 1
  • 1
  • 1
    What is the application? This is the stuff of commercial services - you can build or buy. You do some image correction, OCR, cleaning up/error correction. You can also try Azure Form Recognizer service or the AWS equivalent. – jtlz2 Dec 15 '21 at 09:57
  • Please provide enough code so others can better understand or reproduce the problem. – Community Dec 21 '21 at 19:42

1 Answers1

1

This started as a comment. I believe the answer is valid as it is in no way an endorsement of the service. I don't even use it. I know Azure uses SO as well.

This is the stuff of commercial services. You can try Azure Form Recognizer (with which I am not affiliated):

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer

Here are some python examples of how to use it:

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python

The AWS equivalent is Textract https://aws.amazon.com/textract

The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser

jtlz2
  • 7,700
  • 9
  • 64
  • 114