Unable to extract tables from tabula or Camelot

Question

Tried to extract the below table using Tabula, but it was returning null dataframe. It was working fine for other kinds of similar tables.

Tried using Camelot as well but it didn't work as well. Any suggestions about how can I extract these?

Attached my code

from tabula import read_pdf 
from tabulate import tabulate
from tabula import read_pdf
import pandas as pd
# from tabula.io import read_pdf

Page_No = 1
tables = read_pdf('/content/page1.pdf',pages=Page_No,multiple_tables=True)
df1 = pd.DataFrame(tables[0])
df1

import camelot

tables2=camelot.read_pdf('page1.pdf', flavor='lattice', pages='1')
tables2

As you can read in Camelot docs (https://camelot-py.readthedocs.io/en/master/user/how-it-works.html), you should try `flavor='stream'`, since your table has not demarcated lines between cells — Stefano Fiorucci - anakin87, Nov 14 '22 at 11:39
It is working after adding the flavor='stream'. Thanks @StefanoFiorucci-anakin87 — Pravin, Nov 14 '22 at 13:51
@StefanoFiorucci-anakin87, It was working sometime and next time it throws zero division error for the same table. Any clues? — Pravin, Nov 14 '22 at 14:16
It is a known issue: https://github.com/camelot-dev/camelot/issues/299 You can try to apply thw workaround suggested in the link... — Stefano Fiorucci - anakin87, Nov 14 '22 at 14:50
But it was working before but after restarting the runtime, it stopped working for the same files — Pravin, Nov 14 '22 at 14:55

Pravin · Accepted Answer · 2022-11-14T16:48:57.357

0

The issue got fixed after adding flavor='stream' and 'guess=False' in tabula.

from tabula import read_pdf 
from tabulate import tabulate
from tabula import read_pdf
import pandas as pd
# from tabula.io import read_pdf

Page_No = 1
tables = read_pdf('/content/page1.pdf',pages=Page_No,guess=False,stream=True)
df1 = pd.DataFrame(tables[0])
df1

edited Nov 14 '22 at 16:48

answered Nov 14 '22 at 13:53

Pravin

241
2
14

Unable to extract tables from tabula or Camelot

1 Answers1