-2

I am refactoring a small project which bases on openpyxl to analyze .xlsx inputs. I am trying to reduce time needed for it to do its job and currently trying methods for accessing cell values of Excel worksheet.

I tested two methods (code below) with timeit and one of them seems to be twice as fast as first one.

import timeit
from openpyxl.reader.excel import ExcelReader

samplesPath = r'path_to_excel_file'

workbook = ExcelReader(samplesPath)
workbook.read()

worksheet = workbook.wb['Sheet1']

def func1():
    for i in range(1,10):
        worksheet.cell(i,1).value = 'value'

def func2():
    for i in range(1,10):
        workbook.wb['Sheet1'].cell(i,1).value = 'value'

print(timeit.timeit(func1))
print(timeit.timeit(func2))

Timeit scores:

func1 = 29.23 func2 = 51.07

Can someone explain me the difference between saving worksheet as variable, and then accessing the cell value with it, and accessing it when calling a worksheet every time?

How does it vary in the background, and if there is even faster method?

w8eight
  • 605
  • 1
  • 6
  • 21

1 Answers1

1

This question is not about openpyxl per-se. Yes, you are using openpyxl, but your question is more general and could be applied to many other scenarios.

Now, you are not really using two methods for accessing cells. You are using exactly one, the only difference is how you access the sheet.

In func1 you are using the worksheet object and accessing its cells.

In func2 you are adding another action, which is also accessing the sheet on each iteration. Doing workbook.wb['Sheet1'] means accessing the sheets pool of the workbook and getting the one with name 'Sheet1'. You are doing that on every iteration, so of course that will take more time than using the pre-fetched sheet.


Apart from that, that does seem like a lot of time and part of it might be with how you open the file. Seems like you're overly complexing it. Try doing:

from openpyxl import load_workbook
workbook = load_workbook(path)
worksheet = workbook['Sheet1']
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • Actually, there should be no difference between them as worksheets are loaded into memory when a workbook is read. However, the user is deliberately using openpyxl internals for the calls. – Charlie Clark Aug 20 '19 at 13:58
  • @CharlieClark not sure I understand. Would you mind elaborating? – Tomerikoo Aug 20 '19 at 15:39
  • `load_workbook()` will read the workbooks and every child object so that accessing sheets from workbook is just a dictionary lookup. For whatever reason the poster is using internal code from the library, which is why his workbook is `wookbook.wb` but the calls are the same and do not explain the discrepancies he's claiming. – Charlie Clark Aug 21 '19 at 08:43
  • @CharlieClark I agree the getting the sheet shouldn't take much time, it is just a loop on the sheets that shouldn't be bigger than, say 10. But, it is still added computation. What do you suggest is the cause for such big difference? – Tomerikoo Aug 21 '19 at 09:30
  • GIGO – the docs contain much better comparative tests and hotspots have been profiled. `timeit()` is the wrong tool for what is essentially an integration. – Charlie Clark Aug 21 '19 at 10:58