Pandas: Swapping specific column values within one Dataframe and calculate its weighted averages

Question

There exist the following dataframe:

year	pop0	pop1	city0	city1
2019	20	40	Malibu	NYC
2018	8	60	Sydney	Dublin
2018	36	23	NYC	Malibu
2020	17	44	Malibu	NYC
2019	5	55	Sydney	Dublin

I would like to calculate the weighted average for the population of each city pair as a new column. For example, the w_mean for Malibu / NYC = (23+20+17)/(36+40+44) = 0.5.

Following is the desired output:

year	pop0	pop1	city0	city1	w_mean
2018	23	36	Malibu	NYC	0.5
2019	20	40	Malibu	NYC	0.5
2020	17	44	Malibu	NYC	0.5
2018	8	60	Sydney	Dublin	0.113
2019	5	55	Sydney	Dublin	0.113

I already sorted the dataframe by its columns, but I have issues swapping the 3rd row from NYC/Malibu to Malibu/NYC with its populations. Besides that, I can only calculate the w_mean for each row but not for each group. I tried groupby().mean() but didn't get any useful output.

Current code:

import pandas as pd

data = pd.DataFrame({'year': ["2019", "2018", "2018", "2020", "2019"], 'pop0': [20,8,36,17,5], 'pop1': [40,60,23,44,55], 'city0': ['Malibu','Sydney','NYC','Malibu','Sydney'], 'city1': ['NYC','Dublin','Malibu','NYC','Dublin']})

new = data.sort_values(by=['city0', 'city1'])
new['w_mean'] = new.apply(lambda row: row.pop0 / row.pop1, axis=1)
print(new)

How do you determine which city is city0 and which is city1? For example, if Sydney and Dublin switched places then the w_mean would be 8.846. — Shaido, Jan 07 '21 at 02:17
That is true. The other possibility would be to just sum up the values with the same city pair and to insert two new columns (sum_pop0 and sum_pop1) — Royal.Flush, Jan 07 '21 at 04:46

score 0 · Accepted Answer · answered Jan 07 '21 at 06:19

What you can do is creating a creating tuples of (city, population), put the two tuples in a row into a list and then sort it. By doing this for all rows, you can extract the new cities and populations (sorted alphabetically by city). This can be done as follows:

cities = [sorted([(e[0], e[1]), (e[2], e[3])]) for e in data[['city0','pop0','city1','pop1']].values]
data[['city0', 'pop0']] = [e[0] for e in cities]
data[['city1', 'pop1']] = [e[1] for e in cities]

Resulting dataframe:

    year  pop0  pop1  city0    city1
0   2019    20    40  Malibu     NYC
1   2018    60     8  Dublin  Sydney
2   2018    23    36  Malibu     NYC
3   2020    17    44  Malibu     NYC
4   2019    55     5  Dublin  Sydney

Now, the mean_w column can be created using groupby and transform to create the two sums and then divide as follows:

data[['pop0_sum', 'pop1_sum']] = data.groupby(['city0', 'city1'])[['pop0', 'pop1']].transform('sum')
data['w_mean'] = data['pop0_sum'] / data['pop1_sum']

Result:

    year  pop0  pop1   city0   city1  pop0_sum  pop1_sum    w_mean
0   2019    20    40  Malibu     NYC        60       120  0.500000
1   2018    60     8  Dublin  Sydney       115        13  8.846154
2   2018    23    36  Malibu     NYC        60       120  0.500000
3   2020    17    44  Malibu     NYC        60       120  0.500000
4   2019    55     5  Dublin  Sydney       115        13  8.846154

Any extra columns can now be dropped.

If the resulting w_mean column should always be less than zero, then the last division can be done as follows instead:

data['w_mean'] = np.where(data['pop0_sum'] > data['pop1_sum'], data['pop1_sum'] / data['pop0_sum'], data['pop0_sum'] / data['pop1_sum'])

This will give 0.5 for the Malibu & NYC pair and 0.113043 for Dublin & Sydney.

Pandas: Swapping specific column values within one Dataframe and calculate its weighted averages

1 Answers1