3

I have a list of lists (sub-lists) where each sub-list is a set of characters. The list came from a text fie of data with consecutive sample data aligned vertically. Here is a sample:

""
"Test Method","Generic Stuff.msm"
"Sample I. D.","sed do eiusmod tempor incididunt ut labore et.mss"
"Specimen Number","1"

"Load (lbf)","Time (s)","Extension (in)"

48.081,3.150,0.000
77.307,3.200,0.000
98.159,3.250,0.000
53.256,3.300,0.000
42.476,3.350,0.000
67.080,3.400,0.000
17.786,3.450,0.000
82.600,3.500,0.001
50.644,3.550,0.001
97.122,3.600,0.001
/n
/n

All samples are separated with a double quotation mark and end with a couple new lines.

Pandas does a great job of allowing me to convert those characters into floats or leave them as strings. So, I decided to put the sublists through a for-loop, adjust where the data actually starts (the same headers exist for each specimen), and try to add each sample to its own SHEET in a single Excel WORKBOOK.

Here is the code:

source_file = r'input.txt'
base = os.path.splitext(source_file)[0]
excel_file = base + ".xlsx"

with open(excel_file, 'w') as fp:
    workbook = openpyxl.Workbook()
    initial_work_sheet = workbook.active
    initial_work_sheet.title = 'Create WorkBook'
    initial_work_sheet['A1'] = "Do With This Sheet As You Please"
    workbook.save(excel_file)

with open(source_file, 'r') as file:
    data = file.read().split('""')
    data = [i.split('\n') for i in data]
    data.remove([''])
    for i in np.arange(len(data)):
        data[i] = list(filter(None, data[i]))

source_WB = openpyxl.load_workbook(excel_file)

for sub_data in data:
    sub_data = [s.split(',') for s in sub_data][2:]
    df = pd.DataFrame(sub_data[2:], columns=sub_data[1])
    df['"Load (lbf)"'] = df['"Load (lbf)"'].astype(float)
    df['"Time (s)"'] = df['"Time (s)"'].astype(float)
    df['"Extension (in)"'] = df['"Extension (in)"'].astype(float)
    
    source_WB.create_sheet(' '.join(sub_data[0]))
    print("Writing ", ' '.join(sub_data[0]))

    if ' '.join(sub_data[0]) in source_WB.sheetnames:
        ws = source_WB[' '.join(sub_data[0])]
    else:
        ws = source_WB.active
    for r in dataframe_to_rows(df, index=False, header=True):
        ws.append(r)
    source_WB.save(excel_file)

Initially, I tried to load the entire workbook every time it looped, but that prove to consume too much memory. I then ended with optimized modes of openpyxl, but even then it slows down significantly. The "data" list is about 30,000+ lines long with 3 columns.

Using the optimized mode of openpyxl, I was hoping to copy data from a "read_only" excel file to a "write_only" excel file as recommedend in the docs. I then save the "write_only" excel file over the "read_only" file, which acts as the source for the next loop around.

Aside from the significant slow down after the tenth sample, or so, all data is aligned in a single column for all samples except the last, which has all the data in 3 columns, as intended.

I've looked as much as I know how to ask search engines, but still can't find a good fit for what I am doing.

I have a lot of data, I need to get it into excel, how do I do that quickly? Thanks.

  • 2
    The cause of your slowdown is the triple nested for loops giving you a Big O(n**3) -- also known as Real Bad. You want to refactor this code, preferably using pandas broadcasting. If you'd like more, consider adding in a little sample data. – hrokr Jun 20 '21 at 03:55
  • @hrokr, I added a sample snippet. Does this help? – SuperUser_Vermeylen Jun 20 '21 at 10:00
  • 1
    The triple nested loop is unavoidable but also not that bad, use `values_only` for a slight performance boost. But I don't see the need to include the loops for the worksheet within the dataframe loop, this will needlessly slow things down. – Charlie Clark Jun 21 '21 at 11:17
  • @CharlieClark, wouldn't I have to copy it over when I add sheets? – SuperUser_Vermeylen Jun 21 '21 at 13:20
  • I don't know exactly what you're doing but it seems to me that you don't need to copy the sheets more than once, so this shouldn't be in a loop. Otherwise you're creating and saving the same file multiple times. – Charlie Clark Jun 21 '21 at 13:27
  • @CharlieClark, Every new dataframe needs its own sheet in the workbook. I open the work book as "read_only," open a "write_only" workbook, copy everything from the "read_only" workbook to the "write_only" workbook, append a dataframe to the "write_only", then save over the "read_only" workbook with the "write_only work." I reload the new workbook and begin on the next dataframe. Is that what is supposed to happen in the optimized modes? – SuperUser_Vermeylen Jun 21 '21 at 15:20
  • But if you only need one workbook, why do nest everything in `for sub_data in data`? – Charlie Clark Jun 21 '21 at 15:24
  • @CharlieClark, I apologize for any ambiguity. My code will hopefully state what I am doing better than I. I updated the parts that create a workbook (just for me to understand better what is happening) and added where I grab all data from a text file. As far as I can tell I have to loop through the "list of lists" that is my delimited data file and add each one individually to a worksheet in the workbook. I'm new to python so it is difficult to speak technically. – SuperUser_Vermeylen Jun 21 '21 at 15:56
  • @SuperUser_Vermeylen - The snippet does help but I still have a few questions: 1)Why are you going with adding "each sample to its own SHEET in a single Excel WORKBOOK." if the data columns are all the same? (like, is this an actual requirement or something you think was the best choice?) 2) Why are you reading the data using a `with open` statement as opposed to `pandas.read_excel`? Also, 30k rows of 3 cols isn't all that much. – hrokr Jun 21 '21 at 19:52
  • @hrokr, Each sample has its own set of 3 data columns, so I want each sample in its own excel SHEET, not WORKBOOK. It makes sense to put each sample in its own sheet, no? The data isn't read originally from an excel file. It originally comes from a .txt file. The "with" statement was just something new. Is there a way I can clarify this in the question? – SuperUser_Vermeylen Jun 21 '21 at 19:59
  • @SuperUser_Vermeylen - if each sheet is location or time based, then yes. But, there is nothing in the "Load (lbf)","Time (s)","Extension (in)" that would lead us to think so. -- gotta flight so I'll be offline for a few hours – hrokr Jun 21 '21 at 20:02
  • @hrokr, I'm very confused as to what you recommend. I'm trying to put "Specimen 1" with all its data in its own sheet, "Specimen 2" with all its data in its own sheet, "Specimen 3", etc. Putting them in one sheet would be too much as there are 60+ specimens each with their own data. Are you confusing "worksheet" and "workbook?" Thanks for your time and attention, BTW. – SuperUser_Vermeylen Jun 21 '21 at 20:19
  • @Sergey Solod answered a question here . It writing pandas data frame to existing workbook that works very well, but it begins slowing down after the ~30th dataframe. How do I properly use the optimized methods of openpyxl to make the number of dataframes irrelevant? – SuperUser_Vermeylen Jun 22 '21 at 17:59

2 Answers2

0

I figured it out! Answer is below. I know this was probably plastered everywhere and I simply didn't understand, but this is able to write my dataframes to an excel workbook, each in its own sheet, very rapidly. No excessive memory consumption.

import os
import numpy as np
import pandas as pd
import openpyxl
from openpyxl.utils.dataframe import dataframe_to_rows

print("openpyxl version: ", openpyxl.__version__)

source_file = r'input.txt'

base = os.path.splitext(source_file)[0]
excel_file = base + ".xlsx"

with open(source_file, 'r') as file:
    data = file.read().split('""')
    data = [i.split('\n') for i in data]
    data.remove([''])
    for i in np.arange(len(data)):
        data[i] = list(filter(None, data[i]))

source_WB = openpyxl.Workbook(write_only=True)

for sub_data in data:
    sub_data = [s.split(',') for s in sub_data][2:]
    df = pd.DataFrame(sub_data[2:], columns=sub_data[1])
    df['"Load (lbf)"'] = df['"Load (lbf)"'].astype(float)
    df['"Time (s)"'] = df['"Time (s)"'].astype(float)
    df['"Extension (in)"'] = df['"Extension (in)"'].astype(float)

    source_WB.create_sheet(title=' '.join(sub_data[0]))
    print("Writing ", ' '.join(sub_data[0]))

    ws = source_WB[' '.join(sub_data[0])]
    for r in dataframe_to_rows(df, index=False, header=True):
        ws.append(r)

source_WB.save(excel_file)
0

I think the approach will work but pandas has a built-in excel writer function that pretty quick.

%%timeit

with pd.ExcelWriter('path_to_file.xlsx', mode='w') as writer:
    df_1.to_excel(writer, sheet_name='Sheet1')
    df_2.to_excel(writer, sheet_name='Sheet2')
    df_3.to_excel(writer, sheet_name='Sheet3')
    df_4.to_excel(writer, sheet_name='Sheet4')
    df_5.to_excel(writer, sheet_name='Sheet5')

The results are pretty respectable enter image description here

One thing to note is since you mentioned it bogged, I tested the above with 5 files of 500 rows and three columns each.

hrokr
  • 3,276
  • 3
  • 21
  • 39
  • I posted the answer I was looking for below. It runs through everything in under 1 sec, so not better than you, but I also don't have dataframes at the ready like your code. They start out as a sub_list and then get converted over to a dataframe. I appreciate your attention, though. – SuperUser_Vermeylen Jun 22 '21 at 23:41
  • 1
    That what I did too. I just faked up the data. But there is more than one way to approach a problem. I do hope that you at least consider not going with the O(n**3) approach next time. – hrokr Jun 23 '21 at 00:01
  • As a novice coder, I promise to make every mistake that makes more experienced coders weep. Cheers. – SuperUser_Vermeylen Jun 23 '21 at 00:36