How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.
A couple of other libraries I have tried apart from the paid tool is :
- PyPDF2
- Textract
- Tika,
- pdfPlumber,
- pdfMiner
- PDFtotext
- PyMuPDF – bounding box technique
- Tabula
But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.
So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.