0

I am working on a problem to group rows by using cumulative sum of an attribute (after being ordered). But I am new to python and don't know how to process it. Please kindly advise. Any help is appreciated.

Here is my input, which is a panda dataframe I made. As you can see, neither the key nor the group is ordered.

key  group v1  v2
1_A   1    22  4
1_A  -1    10  11
1_B   2    15  9
1_B   6    15  2
1_A   2    33  43
1_A   5    50  22
1_A   3    5   122
1_B   1    30  8
1_A   4    1   2

For the data processing, I need to calculate cumulative sum of v1 order by group and it is for the rows with the same key. So I suppose I should order the table first. But I am not sure. Please suggest. If I need to order the table first, the new table is like this one shown below. Basically, I am putting rows with the same key together, and order these rows by group.

key  group v1  v2
1_A  -1    10  11
1_A   1    22  4
1_A   2    33  43
1_A   3    5   122
1_A   4    1   2
1_A   5    50  22
1_B   1    30  8
1_B   2    15  9
1_B   6    1   2

Here is my desired output. Mainly I need to do cumulative sum of v1 by order of group, once the cumulative sum reaches a threshold, say 30 here, the accumulation stops, and restart for the next row. This process keeps going until it reaches the last row of the same group. At the end, if the last bin(s) is smaller than 30, then combine them with the lower bin(s), which is shown in 1_B, where group 2 and 6 together makes only 16 (<30), so they need to be combined with group 1.

Note that the bin number can be different from what I have here. As long as it is giving the same bin number to the same group, it works. For example, you can totally replace 1,2,3 to A,B,C, or to 3,2,1, or to A100,B201,M434.

key  group v1  v2    bin       sum_v1    sum_v2
1_A  -1    10  11    1         32        15
1_A   1    22  4     1         32        15
1_A   2    33  43    2         33        43
1_A   3    5   122   3         56        146
1_A   4    1   2     3         56        146
1_A   5    50  22    3         56        146
1_B   1    30  8     1         46        19
1_B   2    15  9     1         46        19
1_B   6    1   2     1         46        19

EDIT: Now I have a full solution posted below as an answer. Enjoy.

Counter10000
  • 525
  • 1
  • 8
  • 25

1 Answers1

1

I created a solution. I was bogged by the full job, but once I realized that it can be broken into small jobs, I was able to solve these smaller tasks one at a time. The process is not tough. The planning is the hard part. So now I am sharing my result with everyone, in case someone has the same puzzle (I already noticed the two booking stars meaning someone's interested). Voila!


import pandas as pd
data = [['1_A',1, 22, 4],['1_A', -1, 10, 11 ],['1_B',2, 15, 9],['1_B',6, 1, 2],['1_A',2, 33, 43 ],['1_A',5, 50, 22 ],['1_A',3, 5 , 122],['1_B',1, 30, 8],['1_A',4, 1 , 2]] 
df_1 = pd.DataFrame(data, columns = ['key', 'group', 'v1', 'v2'])
df_2 = df_1.sort(['key', 'group'])
def f1(df, thresh):
    myList = [] 
    bin = 0     
    sum_v1 = 0     
    sum_v2 = 0   
    new_df = pd.DataFrame(columns = ['key', 'group', 'v1', 'v2', 'sum_v1', 'sum_v2', 'bin']) 
    for i, (key, group, v1, v2) in df.iterrows(): 
        if key not in myList:
            myList.append(key) 
            bin = 1
            sum_v1 = v1
            sum_v2 = v2
        else:
            if sum_v1 < thresh:
                bin += 0
                sum_v1 += v1
                sum_v2 += v2
            else:
                bin += 1
                sum_v1 = v1
                sum_v2 = v2
        new_df.loc[i, ['key']] = key
        new_df.loc[i, ['group']] = group
        new_df.loc[i, ['v1']] = v1
        new_df.loc[i, ['v2']] = v2
        new_df.loc[i, ['sum_v1']] = sum_v1
        new_df.loc[i, ['sum_v2']] = sum_v2
        new_df.loc[i, ['bin']] = bin
    return new_df

new_df_2 = f1(df_2, 30)
df_3 = new_df_2.groupby(['key', 'bin']).agg({'v1': "sum", 'v2': "sum"}).reset_index()
df_3.rename(columns={'v2': 'a_c_sum_v2', 'v1': 'a_c_sum_v1'}, inplace=True)
def f2(df, thresh):
    df_tmp = df.sort(['key', 'bin'], ascending=[1, 0]) 
    myList = [] 
    bin_d = 0 
    sum_v1_d = 0   
    sum_v2_d = 0  
    new_df = pd.DataFrame(columns = ['key', 'bin', 'a_c_sum_v1', 'a_c_sum_v2', 'sum_v1_d', 'sum_v2_d', 'bin_d']) 
    for i, (key, bin, v1, v2) in df_tmp.iterrows(): 
        if key not in myList:
            myList.append(key) 
            bin_d = 1
            sum_v1_d = v1
            sum_v2_d = v2
        else:
            if sum_v1_d < thresh:
                bin_d += 0
                sum_v1_d += v1
                sum_v2_d += v2
            else:
                bin_d += 1
                sum_v1_d = v1
                sum_v2_d = v2
        new_df.loc[i, ['key']] = key
        new_df.loc[i, ['bin']] = bin
        new_df.loc[i, ['a_c_sum_v1']] = v1
        new_df.loc[i, ['a_c_sum_v2']] = v2
        new_df.loc[i, ['sum_v1_d']] = sum_v1_d
        new_df.loc[i, ['sum_v2_d']] = sum_v2_d
        new_df.loc[i, ['bin_d']] = bin_d
    return new_df

new_df_3 = f2(df_3, 30)
df_4 = new_df_3.groupby(['key', 'bin_d']).agg({'a_c_sum_v1': "sum", 'a_c_sum_v2': "sum"}).reset_index()
df_4.rename(columns={'a_c_sum_v2': 'sum_v2', 'a_c_sum_v1': 'sum_v1'}, inplace=True)
m_1 = pd.merge(new_df_3[['key', 'bin', 'bin_d']], df_4[['key', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'bin_d'], right_on=['key', 'bin_d'], how='left')
m_2 = pd.merge(new_df_2[['key', 'group', 'bin']], m_1[['key', 'bin', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'bin'], right_on=['key', 'bin'], how='left')
m_3 = pd.merge(df_1[['key', 'group', 'v1', 'v2']], m_2[['key', 'group', 'bin_d', 'sum_v1', 'sum_v2']], left_on=['key', 'group'], right_on=['key', 'group'], how='left')
m_3.rename(columns={'bin_d': 'bin'}, inplace=True)
m_3.sort(['key', 'group'])
Counter10000
  • 525
  • 1
  • 8
  • 25