0

I am trying to extract all rows from the PDF attached here.

Here is the code I used:

def parse_latticepdf_pages(pdf):
    pages = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        lattice = True,
        silent = True,
        area = [43, 5, 568, 774], 
        pandas_options = {'header': None}
    )
       
    return pd.concat(pages)

parse_latticepdf_pages(pdf = "file.pdf")

The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?

Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.

I would appreciate any help regarding this. Thank you!

Joe
  • 91
  • 6
  • Not sure about that but you can use 'columns' parameter of Tabula if columns are fixed. That way whole table will come in one dataframe. – Kathan Thakkar Apr 21 '22 at 05:21

2 Answers2

1

I managed to finally solve this. For this particular PDF format, it's better to use other python packages such as PyMuPDF. I had posted a similar question on another post in StackOverflow. I am posting the link here. Hope this helps others too struggling to find a solution to a problem similar to that mentioned in this post.

Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row

Joe
  • 91
  • 6
0

Not sure what's happening, but confirmed it works with multiple_tables=False option as the following:

In [41]: tabula.read_pdf(fname, pages=1, lattice=True, area = [43, 5, 568, 774], multiple_tables=False)
Out[41]:
[  Issued Date      Permit No.  ...                                       Proposed Use       Valuation
 0    4/1/2019  P025361-032119  ...  New office and restroom addition to existing\r...      $45,000.00
 1   4/12/2019  P025502-041219  ...  Isolate chapel from fire damaged area 4000 sq....       $1,000.00
 2   4/12/2019  P025487-041019  ...  Interior finish-out for new meat market 2500\r...      $35,000.00
 3   4/15/2019  P025520-041519  ...       New 8-unit apartment building 10,800 sq. ft.     $350,000.00
 4   4/25/2019  P025101-020719  ...                New Five Story Hotel 93,501 sq. ft.  $12,327,000.00
 5    4/9/2019  P025475-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 6    4/9/2019  P025477-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 7    4/9/2019  P025479-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 8    4/8/2019  P025459-040519  ...                                   Build a carport.       $1,000.00

 [9 rows x 7 columns]]

It might cause another issue for page="all" though.

chezou
  • 486
  • 4
  • 12
  • Thank you for your reply. But I see you get the same output as I did. If you checked the PDF I had attached, there are a total of 18 rows. But, with the parameters you have entered (same as I did), only 9 rows are returned. It seems tabula only reads those rows that are in the grey background area and ignores the rows that are in the white background area. – Joe May 19 '22 at 00:19
  • 1
    Ah, that's what I missed. Tried with [tabula app](https://tabula.technology/), which is web app for tabula, and I found it doesn't extract properly. It's a limitation of tabula-java itself, unfortunately. – chezou May 19 '22 at 00:56