1

I want to scrape the pdf table data in any structured format like html,xml,json. I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able to distinguish the data of a table in the pdf.

The pdf image is as shown below:

enter image description here

Venkatesh Wadawadagi
  • 2,793
  • 21
  • 34

1 Answers1

0

You can use Camelot to extract tabular data from PDFs and export it to CSV, Excel, JSON or HTML. You can check out the documentation at: http://camelot-py.readthedocs.io. It would be helpful if you could post a link to your PDF. Here's a generic code example:

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables.export('file.csv', f='csv')

Disclaimer: I'm the author of the library.

Vinayak Mehta
  • 369
  • 4
  • 12
  • Hi Vinayak, This looks interesting. Are you also looking at the underlying graphics of the PDF, such as alternate shades of rows and grid lines? Thank you. – Sau001 Mar 08 '19 at 07:40