Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True

Question

I am trying to extract all rows from the PDF attached here.

Here is the code I used:

def parse_latticepdf_pages(pdf):
    pages = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        lattice = True,
        silent = True,
        area = [43, 5, 568, 774], 
        pandas_options = {'header': None}
    )
       
    return pd.concat(pages)

parse_latticepdf_pages(pdf = "file.pdf")

The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?

Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.

I would appreciate any help regarding this. Thank you!

Not sure about that but you can use 'columns' parameter of Tabula if columns are fixed. That way whole table will come in one dataframe. — Kathan Thakkar, Apr 21 '22 at 05:21

score 1 · Accepted Answer · answered Jul 29 '22 at 20:27

I managed to finally solve this. For this particular PDF format, it's better to use other python packages such as PyMuPDF. I had posted a similar question on another post in StackOverflow. I am posting the link here. Hope this helps others too struggling to find a solution to a problem similar to that mentioned in this post.

Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row

score 0 · Answer 2 · answered May 17 '22 at 21:13

Not sure what's happening, but confirmed it works with multiple_tables=False option as the following:

In [41]: tabula.read_pdf(fname, pages=1, lattice=True, area = [43, 5, 568, 774], multiple_tables=False)
Out[41]:
[  Issued Date      Permit No.  ...                                       Proposed Use       Valuation
 0    4/1/2019  P025361-032119  ...  New office and restroom addition to existing\r...      $45,000.00
 1   4/12/2019  P025502-041219  ...  Isolate chapel from fire damaged area 4000 sq....       $1,000.00
 2   4/12/2019  P025487-041019  ...  Interior finish-out for new meat market 2500\r...      $35,000.00
 3   4/15/2019  P025520-041519  ...       New 8-unit apartment building 10,800 sq. ft.     $350,000.00
 4   4/25/2019  P025101-020719  ...                New Five Story Hotel 93,501 sq. ft.  $12,327,000.00
 5    4/9/2019  P025475-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 6    4/9/2019  P025477-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 7    4/9/2019  P025479-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 8    4/8/2019  P025459-040519  ...                                   Build a carport.       $1,000.00

 [9 rows x 7 columns]]

It might cause another issue for page="all" though.

Thank you for your reply. But I see you get the same output as I did. If you checked the PDF I had attached, there are a total of 18 rows. But, with the parameters you have entered (same as I did), only 9 rows are returned. It seems tabula only reads those rows that are in the grey background area and ignores the rows that are in the white background area. — Joe, May 19 '22 at 00:19
Ah, that's what I missed. Tried with [tabula app](https://tabula.technology/), which is web app for tabula, and I found it doesn't extract properly. It's a limitation of tabula-java itself, unfortunately. — chezou, May 19 '22 at 00:56

Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True

2 Answers2