1

I have a pdf with several text and tables and one row contains like below :

PDF content :
Id: 5647484848 Name Alex J

Now I am using tabula-py for parsing the content, but the result is missing something (means you can see first charater or number is missing).

Actually my original pdf is having lots of text and tables. I tried on other rows too, where i exactly get the right result.

Wrong Result :
['', '', 'Id:', '', '647484848', 'Name', '', 'lex J', '', '', '']

Should be :
['', '', 'Id:', '', '5647484848', 'Name', '', 'Alex J', '', '', '']

Sample :

# to get the exact row to find the name & index [7] is for Name
if len(row) == 11:
    if "Name" in row:
       print(row[7])
       return Student(studentname=row[7])

In tabula while reading table, I have set

df = tabula.read_pdf(pdf, output_format='json', pages='all',
                          password=secure_password, lattice=True)

The row is simple text type , no images and all. Don't know why it fails for this particular row data. I have applied similar logic to other rows where i got proper result. Please suggest.

Agustus
  • 634
  • 1
  • 7
  • 24
  • can you copy/paste the offending line from your pdf into a notepad textfile? if not, pdf might be broken for that datapoint. in PDF "what you see" is not "what you can extract" because visual glyphs and textual meaning are stored seperately. It might look fine, but the "textual representation" might be garbage. If so you would need to research OCR on pdfs and get the missing values from that. – Patrick Artner Dec 07 '19 at 09:17
  • Btw. bad problem for SO - no way to replicate, no way to solve. – Patrick Artner Dec 07 '19 at 09:18
  • @PatrickArtner yes i can copy it to a text editor and there is no issue with that. I can view the original content over there. Thanks – Agustus Dec 07 '19 at 09:22

1 Answers1

2

Solved by changing extraction mode in tabula-py from lattice=True to lattice=False

df = tabula.read_pdf(pdf, output_format='json', pages='all',
                          password=secure_password, lattice=False)
Agustus
  • 634
  • 1
  • 7
  • 24