1

Extracting pdf tables using Tabula-py, It's extracting all rows but not splitting it right. Taken the sample pdf below to extract.

1

tried extraction with below code

import tabula
import json
import pandas as pd

path = "/GST_OCR input Pdfs/gst3.pdf"
col2str = {'dtype': str}
kwargs = {
        "multiple_tables":True,
        'pandas_options': col2str,
        'lattice':False,
        'guess':False
}
csv_data = tabula.read_pdf(path, pages="all",**kwargs)
# with pd.ExcelWriter(csv_data[1].iloc[0,1]+".xls", engine='xlsxwriter') as writer:
#     for i in range(len(csv_data)):
#         csv_data[i].to_excel(writer, sheet_name=f'Sheet {i+1}')
csv_data[5]

it's not extracting rows properly, instead of that it's creating unnamed columns.' Extracting like this 2

Help me regarding this. Thanks in advance

Nag Arjun
  • 11
  • 5

0 Answers0