I don't know about your exact problem but if you want to extract data or tables from PDF then try the camelot-py
library, it is easy and gives almost more than 90% accuracy.
I am also working on the same project.
import camelot
tables = camelot.read_pdf(PDF_file_Path, flavor='stream', pages='1', table_areas=['5,530,620,180'])
tables[0].parsing_report
df = tables[0].df
The parameters of camelot.read_pdf
are:
PDF_File
the give file path;
table_areas
is optional if you get an exact table then provide a location otherwise it can get whole data & all tables;
pages
number of pages.
.parsing_report
show the result description, e.g., accuracy and whitespace.
.df
can show the table as a data frame. Index 0
refer to the 1st table. It depends on your data.
You can read more about them in the camelot
documentation.