using pandas.melt for big dataframe

Question

I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).

I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?

pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns

How many rows does your 20 column table have? – mitoRibo Jun 07 '22 at 18:18 — mitoRibo, Jun 07 '22 at 18:18

score 0 · Answer 1 · answered Jun 07 '22 at 18:40

I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.

Timing melting a large test table

import pandas as pd
import numpy as np
import time

id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000

#Create the test table
df = pd.DataFrame({
    'id':np.random.randint(1,n_ids+1,n_rows),
    'date':np.random.randint(1,n_dates+1,n_rows),
})

factors = []
for c in range(n_cols):
    c_name = 'C{}'.format(c)
    factors.append(c_name)
    df[c_name] = np.random.random(n_rows)
    
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows

Here's a way you can get just the first 1000 melted rows

ret_rows = 1000

start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
    g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
    partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
    
    if len(partial_melt_df) >= ret_rows:
        partial_melt_df = partial_melt_df.head(ret_rows)
        break
    
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows

using pandas.melt for big dataframe

1 Answers1