I am successfully writing dataframes to Excel using df.to_excel()
. Unfortunately, this is slow and consumes gobs of memory. The larger the dataset, the more memory it consumes, until (with the largest datasets I need to deal with) the server starves for resources.
I found that using the df.to_csv()
method instead offers the convenient chunksize=nnnn
argument. This is far faster and consumes almost no extra memory. Wonderful! I'll just write initially to .csv, then convert the .csv to .xlsx in another process. I tried doing it with openpyxl, but every solution I found iterated through the csv data one row at a time, appending to a Workbook sheet, e.g.
with open(temporary_filepath, 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save()
This works, but when I watch my resource monitor, consumes just as much memory and is just as slow (I now assume the original df.to_excel()
was doing the same thing internally). So this approach didn't get me out of the woods after all.
I had assumed I could do something that would be the equivalent of File | Save As, but in Python, e.g. read the whole csv into an openpyxl Workbook and save it to a file all in one go, without iterating, but either this is not possible or I can't find the documentation on it.
Given a very large Pandas dataframe and a requirement to output .xlsx (not .csv), what is the best approach for low memory consumption? Can it be done efficiently with Pandas or Openpyxl, or is there a better tool for the job?
Update: Looks like pyexcel has as a Save As method that might do the trick. Would prefer not to add yet another spreadsheet lib to the stack if possible, but will do if there is no equivalent in pandas or openpyxl. Has anyone used that successfully?