I have separate spreadsheets that contain data for each month of the year - 12 spreadsheets in total. Each workbook contains 200k-500k rows.
e.g.
January
| name | course | grade |
|-------|---------|-------|
| dave | math | 90 |
| chris | math | 80 |
| dave | english | 75 |
February
| name | course | grade |
|-------|---------|-------|
| dave | science | 72 |
| chris | art | 58 |
| dave | music | 62 |
I am using openpyxl to open each monthly workbook, iterate over each row and each cell and write the relevant data to the individuals workbook. i.e. all rows that belong to Chris go in "Chris.xlsx", al rows that belong to Dave go in "Dave.xlsx".
The issue that I am running into is that openpyxl is being extremely slow. I am sure it is due to my code being very procedural and not optimizing the iterations and the writing.
Any thoughts would be greatly appreciated.
def appendToWorkbooks():
print("Appending workbooks")
je_dump_path = "C:/test/"
# define list of files in path
je_dump_files = os.listdir( je_dump_path )
# define path for resultant files
results_path = "C:/test/output/"
max_row = 0
input_row = 1
for file in je_dump_files:
current_row = 1
# load each workbook in the directory
load_file = je_dump_path + file
print("Loading workbook: " + file)
wb = load_workbook(filename=load_file, read_only=True)
print("Loaded workbook: " + file)
# select the worksheet with the name Sheet in each workbook
ws = wb['Sheet']
print("Loaded worksheet")
# iterate through the rows in the currently open workbook
for row in ws.iter_rows():
# determine the person this row of data relates to
person = ws.cell(row=current_row, column=1).value
# set output workbook to that person
output_wb_file = results_path + person + ".xlsx"
output_wb = load_workbook(output_wb_file)
output_ws = output_wb["Sheet"]
# increment the current row
current_row = current_row + 1
print("Currently on row: " + str(current_row))
# determine the last row in the current output workbook
max_row = output_ws.max_row
# set the output row to the row after the last row in the current output workbook
output_row = max_row + 1
for cell in row:
output_ws.cell(row=output_row, column=column_index_from_string(cell.column)).value = cell.value
output_wb.save(output_wb_file)