How to extract a single row table data from a pdf using python?

Question

I need to extract tabular data from pdfs. Some tables in the pdf comprise of only a single row. I have been trying to extract the data using camelot library.

Code for extraction using Camelot:

pip install camelot-py[cv] tabula-py here
import camelot
file = 'xyz.pdf'
tables = camelot.read_pdf(file,pages ="all")
tables[6].df

The above code is not able to extract a single row table info.

For instance, in the pdf: https://www.nirfindia.org/nirfpdfcdn/2022/pdf/Engineering/IR-E-U-0306.pdf, the tool is not able to detect the last table(under the heading Faculty Details) as it consists of only one row.

Can someone suggest a workaround?

Try to reformulate your question according to [here](https://stackoverflow.com/help/minimal-reproducible-example) — Molitoris, Nov 22 '22 at 17:19

score 0 · Accepted Answer · answered Nov 24 '22 at 09:47

0

As you can understand from the docs, if you want to detect smaller lines, you should increase line_scale parameter (default: 15).

In your case, this command works fine:

tables = camelot.read_pdf(file, pages ="all", line_scale=80)

answered Nov 24 '22 at 09:47

Stefano Fiorucci - anakin87

3,143
7
26

How to extract a single row table data from a pdf using python?

1 Answers1