0

I found an old code on a legacy server that writes from a dataset of roughly 20 columns and >60K rows item by item to an Excel file using openpyxl. It does this by filling each cell on its own (the slow way, but not that slow either since the whole thing is done after a few minutes, and with formatting of the cells):

cell = ws.cell('%s%s' % (col_idx, i + 3))
cell.value = unicode(value).encode("utf-8")
get_style(cell, "content", column_colors[col_id])

At roughly 59K rows, it crashes, the console prints:

Killed

The logs show:

Received SIGTERM, shutting down.

The SIGTERM hints at too little memory so that the server kills the task "from outside". It is a virtual machine. Checking the memory with the command free shows that all of the free 3.5 GB get used by writing that small dataset (20 MB on the server) to Excel.

I checked it on the run and found: 500 MB just to configure the file, 700 MB every 10K rows. Writing the dataset leads to:

60 x 700 + 500 = 4700 MB free RAM needed, and the VM has only 3.5 GB. Which should be enough anyway for such a small file in the output. The output in the end is roughly a 20 MB workbook. The output Excel version should be 2007 or 2010, since it is running on Python 2.73 and in outdated legacy code.

Why does writing to a 20 MB workbook with Python's openpyxl module eat Gigabytes of RAM?

questionto42
  • 353
  • 5
  • 23
  • 2
    The strange memory requirements are mentioned in the openpyxl documentation: about 50 times of the size of the excel file https://openpyxl.readthedocs.io/en/stable/performance.html Have you installed lxml? And do you use the write-only mode? – Robert Aug 24 '22 at 21:47
  • @Robert 50 times would be 1000 MB, not almost 5 GB. It is still clear then that it is just the module. I have not installed lxml and I cannot since the server is outdated. I do not use the `write-only`. I looked it up, `wb = Workbook(write_only=True)` at https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode, and it would save memory: `It is able to export unlimited amount of data (even more than Excel can handle actually), while keeping memory usage under 10Mb.` Will check. – questionto42 Aug 24 '22 at 23:08
  • Programming isn't on topic here, sorry. – Zac67 Aug 25 '22 at 05:38
  • @Zac67 I thought it would be a server problem since I got the server memory crashes all of the time for that small sheet. – questionto42 Aug 25 '22 at 07:46
  • @questionto42standswithUkraine When your code crashes the server and the server's fine otherwise, the code is at fault. – Zac67 Aug 25 '22 at 08:30
  • @Zac67 Where should I move it? It is not a coding question either. I have set up the RAM of the VM now and that fixed it. I only needed to know what happens there. On Stack Overflow, I would have to give a code example, but it is not about the right coding. I have not changed a line of code in the end. – questionto42 Aug 25 '22 at 11:26
  • @Robert Could you make it an answer? Just that the module can take so much more memory? That is enough as an answer to the "Why" question at hand. I also tried the write_only mode, but it worked only for one of two workbooks, and that was the workbook without memory problems. – questionto42 Aug 25 '22 at 16:46

1 Answers1

1

That openpyxl requires a lot of RAM for reading/writing Excel files is known:

According to the openpyxl documentation about performance the general RAM usage can be calculated this way.

Memory use is fairly high in comparison with other libraries and applications and is approximately 50 times the original file size, e.g. 2.5 GB for a 50 MB Excel file.

In your case the value seems to be even higher may be your usage is different or the values form the documentation are outdated.

But the documentation also contains hints how to optimize, thus reduce the RAM usage:

  1. Install the Python package lxml - if that package is installed it will be used. It is especially recommended when writing "large" Excel files.

  2. Use Write-Only mode

Robert
  • 221
  • 1
  • 8