0

I have separate spreadsheets that contain data for each month of the year - 12 spreadsheets in total. Each workbook contains 200k-500k rows.

e.g.

January

| name  | course  | grade |
|-------|---------|-------|
| dave  | math    | 90    |
| chris | math    | 80    |
| dave  | english | 75    |

February

| name  | course  | grade |
|-------|---------|-------|
| dave  | science | 72    |
| chris | art     | 58    |
| dave  | music   | 62    |

I am using openpyxl to open each monthly workbook, iterate over each row and each cell and write the relevant data to the individuals workbook. i.e. all rows that belong to Chris go in "Chris.xlsx", al rows that belong to Dave go in "Dave.xlsx".

The issue that I am running into is that openpyxl is being extremely slow. I am sure it is due to my code being very procedural and not optimizing the iterations and the writing.

Any thoughts would be greatly appreciated.

def appendToWorkbooks():
    print("Appending workbooks")
    je_dump_path = "C:/test/"

    # define list of files in path
    je_dump_files = os.listdir( je_dump_path )

    # define path for resultant files
    results_path = "C:/test/output/"

    max_row = 0
    input_row = 1

    for file in je_dump_files:
        current_row = 1

        # load each workbook in the directory
        load_file = je_dump_path + file
        print("Loading workbook: " + file)
        wb = load_workbook(filename=load_file, read_only=True)
        print("Loaded workbook: " + file)

        # select the worksheet with the name Sheet in each workbook
        ws = wb['Sheet']
        print("Loaded worksheet")

        # iterate through the rows in the currently open workbook
        for row in ws.iter_rows():

            # determine the person this row of data relates to
            person = ws.cell(row=current_row, column=1).value

            # set output workbook to that person
            output_wb_file = results_path + person + ".xlsx"
            output_wb = load_workbook(output_wb_file)
            output_ws = output_wb["Sheet"]

            # increment the current row
            current_row = current_row + 1

            print("Currently on row: " + str(current_row))

            # determine the last row in the current output workbook
            max_row = output_ws.max_row

            # set the output row to the row after the last row in the current output workbook
            output_row = max_row + 1

            for cell in row:
                output_ws.cell(row=output_row, column=column_index_from_string(cell.column)).value = cell.value
            output_wb.save(output_wb_file)
Panda
  • 6,955
  • 6
  • 40
  • 55
stutch
  • 33
  • 6
  • @mike-müller saw you had assisted with a similar post on http://stackoverflow.com/questions/35823835/reading-excel-file-is-magnitudes-slower-using-openpyxl-compared-to-xlrd – stutch Dec 21 '16 at 01:50

1 Answers1

0

This line is very expensive to have inside a loop: max_row = output_ws.max_row

But you really need to provide more details about your files and the performance you're seeing. How big are the individual files? How long do they take to load individually? Etc.

Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
  • thanks for the reply. The individual files range from 41MB to 150MB (c. 200k - 900k) rows. It takes about 30 seconds to go through 82 rows. Assuming 200k rows over 12 workbooks, this will take about 10 days to complete. – stutch Dec 21 '16 at 14:53
  • Do not call `ws.cell` within the `ws.iter_rows` loop, especially in read-only mode. It will cause openpyxl to start parsing the file again. You only need to loop through the cells in each row. Don't increment rows yourself, use enumerate to get a counter. Please read the documentation. – Charlie Clark Dec 21 '16 at 21:02
  • Thanks Charlie. I have tried to read the docs. Wouldn't using write_only mode be faster to write the entire row to the new spreadsheet? Though I am not sure how to achieve this. Your help would be much appreciated. – stutch Dec 21 '16 at 22:04
  • To be honest this kind of question would be better on the mailing list. There's a lot wrong with your code and SO isn't the best platform for this kind of discussion. – Charlie Clark Dec 22 '16 at 09:51
  • Thanks Charlie - will pop it on the mailing list. – stutch Dec 22 '16 at 15:47