Efficient way to build a Dataframe in a for loop from a large dataset

Question

I'm trying to build a new Dataframe, which will be converted into a SQLite3 database, with data from three different Dataframes read from three SQLite3 files.

The ideia is that, for each item in UNTRMT, I have to find which items in UCBT are related to it, and for each item in UCBT, which is the CRVCRG item associated to it. Then, I do some math (basically accumulating the quantities) with the UCBT items and append them into lists to save the values for each UNTRMT. Once all the UNTRMT items have been worked, I use the lists (with the appends) to build a new Dataframe and a SQLite3 database, which will be used in further steps of my project.

To ilustrate the size of the problem, the UNTRMT database has around 200,000 rows and 10 columns, the UCBT database has more than 4,5 million rows with 5 columns, and the CRVCRG database has 200 rows with 101 columns.

My code is working properly and can be executed in a proper amount of time, around 15 minutes for 3500 UNTRMT items with the time.time(), which means around 0.25 seconds for item. However, when I try the complete Databases it is taking more than 10 hours to complete, which I think to be too much time. The code is as shown below for DU, but the complete code considers DU, SA and DO, three types of days that are all worked in the exact same way as the shown in the code (the complete code is this same code 3x times for the different types of days).

# Number  o DU in a month - Pandas from a .csv file
num_du = data[m].loc['du']

# SQLite3 database read into DataFrames
crvcrg = query_crvcrg(conn_crvcrg, m)
untrmt, index_untrmt = query_untrmt(conn_bdgd)
ucbt = query_ucbt(conn_bdgd, m)

# Aux lists to build DataFrame
aux_list = [[] for _ in range(4)]

# For each item in UNTRMT
for i in index_untrmt:
    # Slicing the dataframe seems to be faster than what I have tried...
    ucbt_0 = ucbt.loc[ucbt['uni_tr_mt'] == i]
    index_ucbt = ucbt_0.index.values
    
    ncons = len(index_ucbt)
    dic_total = ucbt_0[dic[m]].sum()
    fic_total = ucbt_0[fic[m]].sum()

    # Initializing some variables
    cc_acc_du = np.zeros(96)
    kWh_du_total = 0

    # For each item in UCBT
    for j in index_ucbt:
        # Identifying parameters
        uc = ucbt_0.loc[j] 
        # Identifying tip_cc (to search in CRVCG database)
        tip_cc = uc['tip_cc']       
        # Searching CRVCRG database
        cc_norm = crvcrg.loc[crvcrg['cod_id'] == tip_cc]

        # Only for DU (I have DU, SA and DO, three types of day)
        cc_norm_du = cc_norm.loc[cc_norm['tip_dia'] == 'DU']
        cc_kW_du = nota_tecnica(cc_norm_du, num_du, uc[ene[m]])
        kWh_du = cc_kW_du.sum()*num_du

        # Accumulating the quantities
        cc_acc_du = cc_acc_du + cc_kW_du      
        kWh_du_total += kWh_du

    # Saving the data for each UNTRMT elements in a list to build the Dataframe in the end
    aux_list[0].append(i)
    aux_list[1].append(ncons)
    aux_list[2].append(kWh_du_total)
    aux_list[3].append(cc_acc_du)  

# Building the Dataframe - It is working fine :)
untrmt = pd.concat([base_du(untrmt, aux_list, indice)])

# Dataframe to a new SQLite3 database
conn = sqlite3.connect('Updated_DataBase.sqlite')
c = conn.cursor()
untrmt.to_sql(name='UNTRMT_'+system+'_'+m+'', con=conn, if_exists='replace', index=True)
conn.commit()
conn.close()

Does anyone know how to do this procedure with more efficient structures/functions in order the reduce the computational times?

Best regards!

As this produces an Sqlite database from Sqlite databases it would be more efficient to do the most stuff by SQL statements without a dataframe at all. — Michael Butscher, Jun 30 '23 at 11:39
I tried it but I wasn't able to achieve much better results. the point is that I wasn't able (didn't know how :)) to do the math stuff in SQL statements — Luís Henrique Bandória, Jun 30 '23 at 11:42

Efficient way to build a Dataframe in a for loop from a large dataset

0 Answers0