How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

Question

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.

A couple of other libraries I have tried apart from the paid tool is :

PyPDF2
Textract
Tika,
pdfPlumber,
pdfMiner
PDFtotext
PyMuPDF – bounding box technique
Tabula

But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.

So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.

Why wouldn't you just process the results and throw out what you don't want? — Jim Foye, May 04 '22 at 03:12
The LEADTOOLS [Forms Recognition SDK](https://www.leadtools.com/sdk/ocr/forms/recognition-processing) allows the recognition of structured forms by using a master form template that is created from a blank version of the invoice with fields added to define the desired data to be extracted. Specific [Table Form fields](https://www.leadtools.com/help/sdk/v22/dh/fp/tableformfield.html) can be defined in a master form to recognize designated filled tables of varying content. In addition, scanned and image PDFs can be recognized using OCR. (Disclaimer: I am an employee of the vendor) — Hussam Barouqa, May 12 '22 at 14:49

score 0 · Answer 1 · answered Mar 03 '23 at 00:34

The .csv files that you get from Amazon Textract are a post-processed version of the raw API output. You can use the API output to select what you need based on some criteria that you define.

Let's take the first page of your samples as an example. We use the amazon-textract-textractor package to simplify calling and parsing the response. Despite being very blurry Textract detects two tables there:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
    file_source="./stackoverflow.png",
    features=[TextractFeatures.TABLES],
)
document.visualize(with_words=False)

Now you can simply filter the tables as you need, for example here we only keep the table if the width and height are both greater than 50% of the page. Then you write that table to .csv.

tables = [t for t in document.tables if t.bbox.width > 0.5 and t.bbox.height > 0.5]
with open('output.csv', 'w') as f:
    f.write(tables[0].to_csv())

How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

1 Answers1