How to sum up conditionally columns pairwise on an asymmetric dataset?

Question

Here is a simplified version of my problem:

import pandas as pd

v1 = [1, 5, 0]
v2, v2_tocomb = [2, 4, 9], [0, 1, 0]
v3, v3_tocomb = [3, 3, 8], [2, 0, 1]
v4, v4_tocomb = [0, 7, 6], [1, 1, 1]
v5 = [2, 2, 6]
v1_1 = [3, 4, 1]
v2_1, v2_tocomb_1 = [4, 2, 8], [2, 1, 0]
v3_1, v3_tocomb_1 = [1, 1, 7], [1, 1, 0]
v4_1 = [0, 0, 5]
v5_1 = [9, 5, 5]

data = pd.DataFrame({'v1': v1, 'v2': v2, 'v2_tocomb': v2_tocomb, 'v3': v3, 'v3_tocomb': v3_tocomb, 'v4': v4, 'v4_tocomb': v4_tocomb, 'v5': v5,
                    'v1_1': v1_1, 'v2_1': v2_1, 'v2_tocomb_1': v2_tocomb_1, 'v3_1': v3_1, 'v3_tocomb_1': v3_tocomb_1, 'v4_1': v4_1, 'v5_1': v5_1})

# what to do? #

data_final = pd.DataFrame({'v1': v1, 'v2_comb': [sum(x) for x in zip(v2, v2_tocomb)], 'v3_comb': [sum(x) for x in zip(v3, v3_tocomb)], 'v4_comb': [sum(x) for x in zip(v4, v4_tocomb)], 'v5': v5,
                           'v1_1': v1_1, 'v2_comb_1': [sum(x) for x in zip(v2_1, v2_tocomb_1)], 'v3_comb_1': [sum(x) for x in zip(v3_1, v3_tocomb_1)], 'v4_1': v4_1, 'v5_1': v5_1})

This is the starting dataset (data):

v1	v2	v2_tocomb	v3	v3_tocomb	v4	v4_tocomb	v5	v1_1	v2_1	v2_tocomb_1	v3_1	v3_tocomb_1	v4_1	v5_1
1	2	0	3	2	0	1	2	3	4	2	1	1	0	9
5	4	1	3	0	7	1	2	4	2	1	1	1	0	5
0	9	0	8	1	6	1	6	1	8	0	7	0	5	5

And this is the desired output (data_final):

v1	v2_comb	v3_comb	v4_comb	v5	v1_1	v2_comb_1	v3_comb_1	v4_1	v5_1
1	2	5	1	2	3	6	2	0	9
5	5	3	8	2	4	3	2	0	5
0	9	9	7	6	1	8	7	5	5

The main task is to sum values pairwise for columns that have _comb in their name and the corresponding variable. For example columns v2 and v2_tocomb need to be summed up together in the new column v2_comb (then v2 and v2_tocomb are dropped). The same happens for columns ending with _1, for example, v3_1 and v3_tocomb_1 are combined in the new column v3_comb_1.

I can get data_final for this simplified version of the dataset by doing the described process "by hand", but the original problem refers to 200+ columns, so I need to find a way to automate the process. Furthermore, the original features' names aren't that simple (v1, v2, and so on), but it shouldn't be a problem.

In my opinion, what makes this problem tricky is that the columns to combine can both end with _tocomb or _tocomb_1 and that there is no symmetry, hence a column doesn't necessarily have another one to be summed up with and columns ending with _1 don't follow the same path of the "original" ones.

Actually, I am using only pandas library but I could use any other Python library to achieve the desired output. Can anyone help me?

mozway · Accepted Answer · 2023-02-22T15:11:03.390

1

You can remove the _comb and groupby.sum:

grp = data.columns.str.replace('_tocomb', '', regex=True)

data_final = data.groupby(grp, axis=1, sort=False).sum()

Output:

   v1  v2  v3  v4  v5  v1_1  v2_1  v3_1  v4_1  v5_1
0   1   2   5   1   2     3     6     2     0     9
1   5   5   3   8   2     4     3     2     0     5
2   0   9   9   7   6     1     8     7     5     5

Then modify the columns names if needed with rename:

cols = {c.replace('_tocomb', ''): c.replace('_tocomb', '_comb')
        for c in data.columns.difference(data_final.columns)}
data_final = data_final.rename(columns=cols)

Output:

   v1  v2_comb  v3_comb  v4_comb  v5  v1_1  v2_comb_1  v3_comb_1  v4_1  v5_1
0   1        2        5        1   2     3          6          2     0     9
1   5        5        3        8   2     4          3          2     0     5
2   0        9        9        7   6     1          8          7     5     5

edited Feb 22 '23 at 15:11

answered Feb 22 '23 at 13:16

mozway

194,879
13
39
75

thank you so much both for editing my question and for your clear answer, it was really helpful! I just added these lines at the end to drop duplicated columns and get back to the original order (columns ending with `_1` to the right): `data_final = data_final.iloc[:,~data_final.columns.duplicated()] cols_to_move = data_final.filter(regex='_1$').columns data_final = data_final.drop(columns=cols_to_move).join(data_final[cols_to_move])` – AngelMarcos Feb 22 '23 at 14:41
1

@Angel, I forgot to add `sort=False` see update – mozway Feb 22 '23 at 15:10
you're right, I added it and I was able to get rid of the last two lines of code. Thank you! – AngelMarcos Feb 22 '23 at 15:41

How to sum up conditionally columns pairwise on an asymmetric dataset?

1 Answers1