1

I want to know how to extract particular table column from pdf file in python.

My code so far

    import tabula.io as tb
from tabula.io import read_pdf
dfs = tb.read_pdf(pdf_path, pages='all')
print (len(dfs)) [It displays 73]

I am able to access individual table column by doing print (dfs[2]['Section ID']) I want to know how can I search particular column in all data frame using for loop.

I want to do something like this

for i in range(len(dfs)):
    if (dfs[i][2]) == 'Section ID ' //(This gives invalid syntax)
    print dfs[i]
user1107731
  • 357
  • 1
  • 2
  • 10

1 Answers1

0

If you have only one dataframe with Section ID name (or are interested only in the first dataframe with this column) you can iterate over the list returned by read_pdf, check for the column presence with in df.columns and break when a match is found.

import tabula.io as tb
from tabula.io import read_pdf
df_list = tb.read_pdf(pdf_path, pages='all')

for df in df_list:
    if 'Section ID' in df.columns:
        break
print(df)

If you may have multiple dataframes with the Section ID column, you can use list comprehension filter and get a list of dataframes with that column name.

dfs_section_id = [df for df in df_list if 'Section ID' in df.columns]
n1colas.m
  • 3,863
  • 4
  • 15
  • 28