How to extract a table from a PDF without manually tweaking the parameters?

Asked Mar 27 '23 at 13:43

Active Mar 27 '23 at 13:43

Viewed 46 times

I know the packages camelot and tabula-py and they can read tables from a PDF file. Problem is that each PDF file is different and therefore the parameter settings that work for one PDF file do not work for another PDF file. Since my preprocessing pipeline needs to be automated, I cannot tweak the settings for each PDF file.

For example, for the following file I can extract the table after tweaking: https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf

import camelot
import pandas as pd

tables = camelot.read_pdf('table.pdf', flavor='stream', row_tol=20, edge_tol=20, strip_text='\n') 
print(tables[0].parsing_report)
tables[0].df

But there are other files that do not work with these settings. I would be glad to have your advice how to make this work for any PDF file without manual tweaking. Thank you very much in advance!

asked Mar 27 '23 at 13:43

Ruthger Righart

4,799
2
28
33

How to extract a table from a PDF without manually tweaking the parameters?

0 Answers0