0

I have a dataframe that is imported from csv/excel file. The file is a simulation data that sweeps variables. Different versions of the file sweep different variables. Sometimes, it sweeps variables a,c,e, other times, it sweeps variables b,c,d, .etc.

Currently, when slicing the interactive dataframe idf (idf = df.interactive()), I manually change the code, depending on the prior knowledge of what is swept.(slicing on all the columns in one step with "&" operator.)The execution time of slicing is 7.67 seconds in the example below. However, if I want to detect what variables are swept and slice the dataframe interactively, the execution speed increases a lot. (382 seconds.)

I wonder whether there is a syntax that enjoy both the fast speed and automatic detection of what variables are swept.

import pandas as pd
import numpy as np
import time 
import panel as pn
pn.extension('tabulator')
import hvplot.pandas
import holoviews as hv

list_alphabet = ['a','b','c','d','e']
list_year = [2000,2001,2002,2003,2004,2005]
list_country = ['US','BE','NL','CN']
list_color = ['white','blue','red','green','black']
list_fruit = ['apple','banana','pear','peach']
list_price = ['high','medium','low']
list_currency = ['USD','EUR','CNY','CAD']
list_transport = ['walking','biking','car','bus','train','tram','plane']
list_age = [10,20,30,40,50,60,70]

dict_lst ={'alphabet':list_alphabet,'year':list_year,'country':list_country,'color':list_color,'fruit':list_fruit,'price':list_price,\
       'currency':list_currency,'transport':list_transport,'age':list_age}

list_itertools = [dict_lst[x] for x in dict_lst.keys()]
df = pd.DataFrame(list(itertools.product(*list_itertools)), columns=[x for x in dict_lst.keys()])

df['value'] = df['year']+10*df['age']

if 'data' not in pn.state.cache.keys():
    pn.state.cache['data'] = df.copy()
else: 
    df = pn.state.cache['data']

dict_selector = {}
for selector_name in lst_possible_swept_variables:
    if selector_name in df.columns:
        selector = pn.widgets.Select(name=selector_name, options=dict_lst[selector_name])
        dict_selector[selector_name] = selector

idf = df.interactive()

# fast-execution but hard-coded  
start_time = time.time()
df_pipeline =  (
idf[   
    (idf.alphabet == dict_selector['alphabet']) & \
    (idf.year == dict_selector['year']) & \
    (idf.country == dict_selector['country']) & \
    (idf.color == dict_selector['color']) & \
    (idf.fruit == dict_selector['fruit']) & \
    (idf.price == dict_selector['price']) & \
    (idf.currency == dict_selector['currency']) & \
    (idf.transport == dict_selector['transport']) & \
    (idf.age == dict_selector['age']) \
].groupby(['alphabet','year','country','color','fruit','price','currency','transport','age'])
    ['value'].mean().to_frame().reset_index().sort_values(by='age').reset_index(drop=True)
)
print("--- %s seconds ---" % (time.time() - start_time))
## --- 7.60701847076416 seconds ---

# self-adjusting code but executed much slower
start_time = time.time()
idf_slice = idf
for key_name in dict_selector.keys():
    idf_slice = idf_slice[idf_slice[key_name] == dict_selector[key_name]]
df_pipeline_slice = (idf_slice.groupby([x for x in dict_selector.keys()])['value'].mean().to_frame().reset_index().sort_values(by='age').reset_index(drop=True)
)
print("--- %s seconds ---" % (time.time() - start_time))
## --- 381.84749722480774 seconds ---
John
  • 348
  • 3
  • 15

0 Answers0