How to detect table in PDF when each PDF have different formats?

Asked Apr 10 '23 at 06:36

Active Apr 10 '23 at 06:36

Viewed 334 times

I am having task at hand where users have multiple types of PDFs (number of variations is in 100s) and I am supposed to extract table with specific characteristics from those. Each PDF can have multiple tables. One more issue is, tables have similar characteristics but column names and column numbers can be different. Tables can be either with borders or without borders. I can say everything is variable and I am stuck with approach now. I have successfully added all tables in camelot but not sure how to get that specific table I want. Note: I have developed model with Langchain and GPT-3.5 which does the job but I need to develop in-house solution. I am not expecting any code help, I would love some help with approach. Thanks

I tried camelot and after playing with advanced parameters, I am getting data but for different tables, I am stuck how to get specific table.

asked Apr 10 '23 at 06:36

siddharth patel

I am not sure if I fully understand what you want. Can you maybe share your langchain code and provide a synthetic pdf (as an example)? – cronoik Apr 10 '23 at 21:08

How to detect table in PDF when each PDF have different formats?

0 Answers0