1

How can I properly use task delayed for a group-wise quotient calculation over multiple columns?

some sample data

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'name': ['A', 'B', 'C', 'D', 'E'],
        'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
        'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
        'target': [0,0,0,1,1],
        'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')

some setup code which determines the string / categorical columns

FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'

the main part: the calculation of the group-wise quotients

def compute_weights(da, colname):
    # group only a single time
    grouped = da.groupby([colname, target]).size() 
    # calculate first ratio
    df = grouped / da[target].sum() 
    nameCol = "pre_" + colname 
    grouped_res = df.reset_index(name=nameCol) 
    grouped_res = grouped_res[grouped_res[target] == 1] 
    grouped_res = grouped_res.drop(target, 1) 
    # todo persist the result in dict for transformer
    result_1 = grouped_res
    return result_1, nameCol

And now actually calling it on multiple columns

original = df_a.copy()
output_df = original
ratio_weights = {}

for colname in columnsToBias_keep.union(columnsToDrop):
    result_1, result_2, nameCol, nameCol_2 = compute_weights(original, colname)

    # persist the result in dict for transformer
    # this is required to separate fit and transform stage (later on in a sklearn transformer)
    ratio_weights[nameCol] = result_1
    ratio_weights[nameCol_2] = result_2

when trying to use dask delayed, I need to call compute which breaks the DAG. How can I curcumvent this, in order to create a single big computational graph which is calculated in parallel?

compute_weights = delayed(compute_weights)
a,b = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[b] = a
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • 1
    I suspect that if you are able to reduce your problem to a [mcve](https://stackoverflow.com/help/mcve) that you will get an answer more quickly. Generally this means stripping out all of the details specific to your domain and focusing only on API questions, perhaps with a toy example with a minimum number of lines of code. – MRocklin May 09 '17 at 23:17

0 Answers0