0

I have a code where I am using tabula-py to read tables from pdf and then write the resulting list of dataframes to a single excel with each dataframe in a separate sheet.

Here is my current code:

def read_pdf(pdf_file):
    
    output_filepath = "output.xlsx"
    dfs = tabula.read_pdf(pdf_file, pages='all')
    for i in range(len(dfs)):
        print(dfs[i].to_string())
        with ExcelWriter(output_filepath) as writer:
            dfs[i].to_excel(writer, sheet_name='sheet%s' % i)

With the print function I can see dataframes with values, but the resulting excel is empty with just one sheet and no output in it.

user2966197
  • 2,793
  • 10
  • 45
  • 77

2 Answers2

1

You are opening the ExcelWriter object multiple times within the loop, which may result in overwriting the previous data. You should move the with statement outside of the loop so that you open the ExcelWriter object only once. Somethink like that:

def read_pdf(pdf_file):
    
    output_filepath = "output.xlsx"
    dfs = tabula.read_pdf(pdf_file, pages='all')
    writer = pd.ExcelWriter(output_filepath, engine='xlsxwriter')
    for i in range(len(dfs)):
        print(dfs[i].to_string())
        dfs[i].to_excel(writer, sheet_name='sheet%s' % i)
    writer.save()

Ive used the engine= 'xlswriter' cause of the better performance and it supports more advanced features than the default engine. At the end ive added 'writer save' outside the loop to save the Excel file after writing. If you dont need it, you can just delete it.

Chewbacca
  • 139
  • 6
1

You need to reverse the for-loop <=> context-manager :

def read_pdf(pdf_file):
    
    output_filepath = "output.xlsx"
    dfs = tabula.read_pdf(pdf_file, pages="all")

    with ExcelWriter(output_filepath) as writer:
        for idx, df in enumerate(dfs, start=1)
            df.to_excel(writer, sheet_name=f"Sheet_{idx}")
Timeless
  • 22,580
  • 4
  • 12
  • 30