Extracting data from a table of pdf to a structured format

Question

I want to scrape the pdf table data in any structured format like html,xml,json. I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able to distinguish the data of a table in the pdf.

The pdf image is as shown below:

Vinayak Mehta · Answer 1 · 2018-11-09T18:50:23.210

0

You can use Camelot to extract tabular data from PDFs and export it to CSV, Excel, JSON or HTML. You can check out the documentation at: http://camelot-py.readthedocs.io. It would be helpful if you could post a link to your PDF. Here's a generic code example:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables.export('file.csv', f='csv')

Disclaimer: I'm the author of the library.

edited Nov 09 '18 at 18:50

answered Nov 09 '18 at 18:33

Vinayak Mehta

369
4
12

Hi Vinayak, This looks interesting. Are you also looking at the underlying graphics of the PDF, such as alternate shades of rows and grid lines? Thank you. – Sau001 Mar 08 '19 at 07:40

Extracting data from a table of pdf to a structured format

1 Answers1