2

I have this dataframe and I want to transform it into another dataframe with a column which combines observations from several columns in the first dataframe and aggregates values from the column "points". Here's the dataframe and below is the desired result:

player_data = pd.DataFrame({"customer_id": ["100001", "100002", "100005", "100006", "100007", "100011", "100012", 
                                            "100013", "100022", "100023", "100025", "100028", "100029", "100030"],
                            "country": ["Austria", "Germany", "Germany", "Sweden", "Sweden", "Austria", "Sweden", 
                                        "Austria", "Germany", "Germany", "Austria", "Austria", "Germany", "Austria"],
                            "category": ["basic", "pro", "basic", "advanced", "pro", "intermidiate", "pro", 
                                         "basic", "intermidiate", "intermidiate", "advanced", "basic", "intermidiate", "basic"],
                            "gender": ["male", "male", "female", "female", "female", "male", "female",
                                       "female", "male", "male", "female", "male", "male", "male"],
                            "age_group": ["20", "30", "20", "30", "40", "20", "40",
                                          "20", "30", "30", "40", "20", "30", "20"],
                            "points": [200, 480, 180, 330, 440, 240, 520, 180, 320, 300, 320, 200, 280, 180]})

The new dataframe is supposed to look like this:

Thank you all!

3 Answers3

2

Would this be what you are looking for?

df_new = df.groupby(['country', 'category', 'gender', 'age_group'])['points'].agg('sum').reset_index()
df_new.pivot_table(values = 'points', index = ['country', 'category', 'gender'], columns = 'age_group', fill_value = 0).reset_index().sort_values(['country', 'category', 'gender'])

However, this will not have any columns that have only 0s for example Australia | Advanced | M will not be in here since there wasn't any mention in it for the original df. If you wanted to dynamically add them you might need to rethink the structure of your df.

ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17
  • Yes, this result is similar and I managed to combine the three columns into one. Now I'm trying to create a visualization - a heatmap in particular, but doesn't seem to work. – Yordanka Stefanova Aug 24 '22 at 15:14
  • Great I'm glad I was able to help! If you liked it please upvote and hit answered so others can find a solution to a similar question. – ArchAngelPwn Aug 25 '22 at 02:45
  • I marked it as "answered" and rated it +1, but since I'm new on this platform my votes still don't show - I get a message "Thanks for the feedback! You need at least 15 reputation to cast a vote, but your feedback has been recorded." As soon as my votes start to count I'll cast my vote again. Thank you! – Yordanka Stefanova Aug 25 '22 at 15:12
2

Try this:

midx = pd.MultiIndex.from_product([player_data['country'].unique(), 
                                   player_data['category'].unique(), 
                                   player_data['gender'].unique()])
player_data.groupby(['country', 'category', 'gender', 'age_group'])['points']\
           .sum()\
           .unstack(fill_value=0)\
           .reindex(midx, fill_value=0)

Output:

age_group                     20   30   40
Austria basic        male    580    0    0
                     female  180    0    0
        pro          male      0    0    0
                     female    0    0    0
        advanced     male      0    0    0
                     female    0    0  320
        intermidiate male    240    0    0
                     female    0    0    0
Germany basic        male      0    0    0
                     female  180    0    0
        pro          male      0  480    0
                     female    0    0    0
        advanced     male      0    0    0
                     female    0    0    0
        intermidiate male      0  900    0
                     female    0    0    0
Sweden  basic        male      0    0    0
                     female    0    0    0
        pro          male      0    0    0
                     female    0    0  960
        advanced     male      0    0    0
                     female    0  330    0
        intermidiate male      0    0    0
                     female    0    0    0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
2

This works. Although the loops are a pretty janky way of sorting the zeros.

df = player_data.groupby(["country", "category", "gender", "age_group"]).points.sum().reset_index()
df = df.pivot_table(values='points', index=['country', 'category', 'gender'], columns='age_group', fill_value=0)


for country in player_data.country.unique():
    for category in player_data.category.unique():
        for gender in player_data.gender.unique():
            if (country, category, gender) not in df.index:
                df.loc[(country, category, gender)] = np.zeros(len(player_data.age_group.unique()), dtype=int)

df = df.sort_values(['country', 'category', 'gender']).reset_index()

Output:

age_group  country      category  gender   20   30   40
0          Austria      advanced  female    0    0  320
1          Austria      advanced    male    0    0    0
2          Austria         basic  female  180    0    0
3          Austria         basic    male  580    0    0
4          Austria  intermidiate  female    0    0    0
5          Austria  intermidiate    male  240    0    0
6          Austria           pro  female    0    0    0
7          Austria           pro    male    0    0    0
8          Germany      advanced  female    0    0    0
9          Germany      advanced    male    0    0    0
10         Germany         basic  female  180    0    0
11         Germany         basic    male    0    0    0
12         Germany  intermidiate  female    0    0    0
13         Germany  intermidiate    male    0  900    0
14         Germany           pro  female    0    0    0
15         Germany           pro    male    0  480    0
16          Sweden      advanced  female    0  3...