2

I am successfully writing dataframes to Excel using df.to_excel(). Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.

I found that using the df.to_csv() method instead offers the convenient chunksize=nnnn argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.

with open(temporary_filepath, 'r') as f:
    for row in csv.reader(f):
        ws.append(row)
wb.save()

This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel() was doing the same thing internally). So this approach didn't get me out of the woods after all.

I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.

Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?

Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?

shacker
  • 14,712
  • 8
  • 89
  • 89

2 Answers2

5

Probably you could use the library pyexcelerate - https://github.com/kz26/PyExcelerate. They have posted the benchmarks on their github repo


from pyexcelerate import Workbook

values = [df.columns] + list(df.values)
wb = Workbook()
wb.new_sheet('data_sheet_name', data=values)
wb.save('data.xlsx')
Bogdan Doicin
  • 2,342
  • 5
  • 25
  • 34
bigbounty
  • 16,526
  • 5
  • 37
  • 65
  • Thanks @bigbounty, this looks like exactly what we were looking for. – shacker Jul 21 '20 at 01:05
  • The linked benchmark is very much out of date and pyexcelerate is very limited in its support of OOXML. openpyxl has its own more detailed benchmarking: https://openpyxl.readthedocs.io/en/latest/performance.html At the end of the day, there's a trade off between functionality and performance. – Charlie Clark Jul 21 '20 at 08:17
  • @CharlieClark Link removed. – Bogdan Doicin Nov 15 '21 at 11:13
4

The pyexcelerate response is exactly what I asked about, so I accepted that answer, but just wanted to post an update that we found an alternate solution that's possibly even easier. Sharing here in case it's useful.

Pandas now prefers xlsxwriter over openpyxl. If it's installed, and you do not specify the engine, xlsxwriter will be used by default (or of course you can specify it explicitly). In my experiments, xlsxwriter was 4x more memory efficient than openpyxl at the task of writing to Excel. This not an infinitely scalable solution - it's still conceivable that one could receive a dataset so large that it still overwhelms memory even with this optimization - but it's extremely easy: Just pip install xlsxwriter and you get a 4x bump in memory use when calling df.to_excel(), with no code changes (in my case).

shacker
  • 14,712
  • 8
  • 89
  • 89
  • Pandas has always used xlsxwriter by default, which is fine if all you're doing is creating new files. But if memory is likely to be an issue then it is advisable to avoid `to_excel()` entirely and use the libraries directly. – Charlie Clark Jul 22 '20 at 11:33
  • In pandas v1.3.0 documentation, `engine='openpyxl'` is defaulted for reading file. And with recent experiment conducted on 21 March 2022, explicitly assigning `engine='xlsxwriter'` indeed improve memory management when writing. – Azhar Mar 21 '22 at 04:05