I am trying to extract table information from pdf using Camelot-py library. Initially using stream function like this:
import camelot
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1', columns=['110,400'], split_text=True, row_tol=10)
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
However could not get the desired outcome, even after adjusting the columns value. Then I switched to lattice flavor. It can now determine the column accurately, however due to the nature that the pdf source does not separate rows using lines, the whole table content are extracted on one row.
Below using lattice:
import camelot
tables = camelot.read_pdf('sample_camelot_extract.pdf', flavor='lattice', pages='1')
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
The logic that I want to implement is that for each new text that exists on the first column (FIG ITEM), it should be the start of the new row.
Have tried both flavors but not sure which is the best approach.
Link for original file here:
Thank you.