2

I am trying to scrape data from a PDF so that I can reformat it and then insert it to a table in Oracle. I am trying to use Tabula to read the PDF and convert it to a list of tables, but Tabula seems to be dropping columns from tables if those columns only hold null values. Normally this wouldn't be an issue (the data is 'None' to begin with, so I don't care about preserving it), but dropping the 'null' values on certain columns but not on others makes it impossible for my code to identify which columns are which. Eg, it might go from:

0   1   2   3

x   x  n/a  x

x   x  n/a  x

x   x  n/a  x

to

0   1   2

x   x   x

x   x   x

x   x   x

There is no way to know during runtime which column is being dropped, so I can't just re-insert it to the necessary place.

The columns do not have any unique identifiers in the data. I can't just add a null column at the end because it is absolutely necessary that I keep the same ordering in the columns.

I have investigated the Tabula API, and while I found a number of handy guides for how to DROP null columns, I found nothing for ensuring that they stay present.

dflist = tabula.read_pdf(path, pages = '14-27', multiple_tables = True)
# dflist is a list of dataframes
# dflist[0] == a single dataframe

(Apologies for poor formatting; unfamiliar with stack overflow spacing)

Expected results:

0   1   2   3

X   NaN X   X   

X   NaN X   X   

X   NaN X   NaN

Actual results:

0   1   2

X   X   X   

X   X   X   

X   X   NaN
BENY
  • 317,841
  • 20
  • 164
  • 234
NicholasTW
  • 85
  • 1
  • 10

3 Answers3

1

UPDATE: The best solution I could find was fiddling with the 'lattice' settings which determined how tables are read in Tabula (you can find documentation on their site). Unfortunately, these settings also offset some of the rows on my PDF, so I couldn't use it. I had to give up the idea of making it entirely automated, and now use a staging table where a human checks which columns will be dropped.

NicholasTW
  • 85
  • 1
  • 10
1

I had the same problem, looked also at other examples of this on GitHub. I think that whenever you are avoiding reading the headers with

pandas_options={'header': None}

and there are empty columns, this problem will manifest. I tested having it reading the headers, but the titles of the columns in my pdf were messy in their own way so that was not really an option. Doesn't seem right when you set the 'columns' and 'area' parameters that this problem should still happen even without the headers. You tell it exactly where the columns are and it still drops them if they are empty. It's probably a deficit of tabula - there really is nothing for ensuring that they stay present.

I was also about to despair and do it manually then I came up with a workaround.

So like the OP I was also reading multiple tables, actually hundreds of pdf pages. The normal number of columns was in my case 9, but if it was not 9, it would predictably be 8. Also, it only ever happened with the 9th column, in the original data, that was the only column which on some pages was completely empty. So for the dataframes in the list, just tacked on a column of NaN's where tabula failed to put one:

dfs_clean = list()

for df in dflist:
    if df.shape[1] != 9:
        df[8] = np.nan 
        dfs_clean.append(df)       
        continue        
    dfs_clean.append(df)

one_df = pd.concat(dfs_clean)

This only works if it's always the same predictable column which is sometimes is empty, so you can programatically replace it with nans.

It would be nicer if the maintainers could either add this feature to tabula, or tell us how to use it if it already exists. The Python interface has a java_options parameter, maybe there is something you can put in there which would do it.

cardamom
  • 6,873
  • 11
  • 48
  • 102
1
  1. If possible, get the lattice version of the specific table you're trying to scrape (i.e. a table with grid lines separating each cell, like a PDF of an Excel spreadsheet). Then, if you do have the lattice version, try setting lattice=True in read_pdf().
  2. If there's no lattice version and you're stuck with a table with empty space between cells, you will have to define the point coordinates of the table's column boundaries so that Tabula will be able to differentiate between empty space and an empty column.
  3. Get the point coordinates of the table's column boundaries by checking out https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want. There are two possible ways to get it, but what you'll probably do is download the Tabula app, open it, and upload your PDF there.
  4. Instead of selecting the entire table area as written in Step 3 from the link, select each column one-by-one, and follow Steps 4 to 5 as usual.
  5. The generated script will contain 4 coordinates. Get the x-coordinates of the column (2nd and 4th) and take note of it for later. Repeat for every single column, until you have x-coordinates for every column.

For example:

0   1   2   3
    
X   NaN X   X   
    
X   NaN X   X   
    
X   NaN X   NaN
  • Column 0's x-coordinates: 0, 10
  • Column 1's x-coordinates: 10, 20
  • Column 2's x-coordinates: 20, 30
  • Column 3's x-coordinates: 30, 40
  1. Read https://tabula-py.readthedocs.io/en/latest/faq.html?highlight=options#can-i-use-option-xxx. The shown example uses the column option, which is documented in https://github.com/tabulapdf/tabula-java/blob/master/README.md as follows:

-c,--columns <COLUMNS> x coordinates of column boundaries. Example --columns 10.1,20.2,30.3. If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual width of the page. Example: --columns %25,50,80.6

  1. Write the columns option accordingly using the x-coordinates of the columns you've recorded earlier, and pass it into the options arg in read_pdf().

For example:

0   1   2   3
    
X   NaN X   X   
    
X   NaN X   X   
    
X   NaN X   NaN
  • Column 0's x-coordinates: 0, 10
  • Column 1's x-coordinates: 10, 20
  • Column 2's x-coordinates: 20, 30
  • Column 3's x-coordinates: 30, 40
col_boundaries = '--columns 10,20,30'
dflist = tabula.read_pdf(path, pages = '14-27', options=col_boundaries)

Note: This only works if all pages have their columns at exactly the same place. For pages with different columns or layouts, you'll have to run read_pdf() using their specific boundaries as well.

Weepy
  • 11
  • 3