-1

I am trying to extract a table like this into a Dataframe. How to do that (and extract even the names splitted on several lines) with Python?

Also, I want this to be general and to be applied on each table (even if it doesn't this structure), so giving the coordinates for each separate and different table won't work that well.

enter image description here

stephsmith
  • 171
  • 5

1 Answers1

1

I don't know about your exact problem but if you want to extract data or tables from PDF then try the camelot-py library, it is easy and gives almost more than 90% accuracy. I am also working on the same project.

import camelot
tables = camelot.read_pdf(PDF_file_Path, flavor='stream', pages='1', table_areas=['5,530,620,180'])
tables[0].parsing_report
df = tables[0].df

The parameters of camelot.read_pdf are:

  • PDF_File the give file path;
  • table_areas is optional if you get an exact table then provide a location otherwise it can get whole data & all tables;
  • pages number of pages.

.parsing_report show the result description, e.g., accuracy and whitespace.

.df can show the table as a data frame. Index 0 refer to the 1st table. It depends on your data.

You can read more about them in the camelot documentation.

Edward Ji
  • 745
  • 8
  • 19