I know the packages camelot
and tabula-py
and they can read tables from a PDF file. Problem is that each PDF file is different and therefore the parameter settings that work for one PDF file do not work for another PDF file. Since my preprocessing pipeline needs to be automated, I cannot tweak the settings for each PDF file.
For example, for the following file I can extract the table after tweaking: https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf
import camelot
import pandas as pd
tables = camelot.read_pdf('table.pdf', flavor='stream', row_tol=20, edge_tol=20, strip_text='\n')
print(tables[0].parsing_report)
tables[0].df
But there are other files that do not work with these settings. I would be glad to have your advice how to make this work for any PDF file without manual tweaking. Thank you very much in advance!