I am trying to scrape data from a PDF so that I can reformat it and then insert it to a table in Oracle. I am trying to use Tabula to read the PDF and convert it to a list of tables, but Tabula seems to be dropping columns from tables if those columns only hold null values. Normally this wouldn't be an issue (the data is 'None' to begin with, so I don't care about preserving it), but dropping the 'null' values on certain columns but not on others makes it impossible for my code to identify which columns are which. Eg, it might go from:
0 1 2 3
x x n/a x
x x n/a x
x x n/a x
to
0 1 2
x x x
x x x
x x x
There is no way to know during runtime which column is being dropped, so I can't just re-insert it to the necessary place.
The columns do not have any unique identifiers in the data. I can't just add a null column at the end because it is absolutely necessary that I keep the same ordering in the columns.
I have investigated the Tabula API, and while I found a number of handy guides for how to DROP null columns, I found nothing for ensuring that they stay present.
dflist = tabula.read_pdf(path, pages = '14-27', multiple_tables = True)
# dflist is a list of dataframes
# dflist[0] == a single dataframe
(Apologies for poor formatting; unfamiliar with stack overflow spacing)
Expected results:
0 1 2 3
X NaN X X
X NaN X X
X NaN X NaN
Actual results:
0 1 2
X X X
X X X
X X NaN