How to summarise data by percentages in pandas

Question

This code:

  #Missing analysis for actions - which action is missing the most action_types?
    grouped_missing_analysis = pd.crosstab(clean_sessions.action_type, clean_sessions.action, margins=True).unstack()
    grouped_unknown = grouped_missing_analysis.loc(axis=0)[slice(None), ['Missing', 'Unknown', 'Other']]
    print(grouped_unknown)

Leads to the printing of this:

action                        action_type
10                            Missing              0
                              Unknown              0
11                            Missing              0
                              Unknown              0
12                            Missing              0
                              Unknown              0
15                            Missing              0
                              Unknown              0
about_us                      Missing              0
                              Unknown            416
accept_decline                Missing              0
                              Unknown              0
account                       Missing              0
                              Unknown           9040
acculynk_bin_check_failed     Missing              0
                              Unknown              1
acculynk_bin_check_success    Missing              0
                              Unknown             51
acculynk_load_pin_pad         Missing              0
                              Unknown             50

How would I now aggregate the total Missing, Unknown and Other for each action as a total value count for each action, and have as a percentage of All action_types which are Missing, Unknown or Other? So for example, there would be one row for each action, and about_us row would have 406+0/Total Missing + Unknown + Other for all actions.

See this question for context.

The problem is that the above contains a row right at the bottom of it called All which is the sum of everything, so:

All                           Missing        1126204
                              Unknown        1031170

Desired output would be:

action                        percent_total_missing_action_type
10                            0
11                            0
12                            0
15                            0
about_us                     416/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
accept_decline                0
account                       9040/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
acculynk_bin_check_failed     1/total_missing_action_type (in the All row - 2157374, or the sum of everything in the action_type column)
etc..

Here is some test data:

action                        action_type
    a                            Missing              2
                                 Unknown              5
    b                            Missing              3
                                 Unknown              4
    c                            Missing              5
                                 Unknown              6
    d                            Missing              1
                                 Unknown              9
    All                          Missing             11
                                 Unknown             24

Which should go into this:

     action                        action_type_percentage
    a                            Missing              2/11
                                 Unknown              5/24
    b                            Missing              3/11
                                 Unknown              4/24
    c                            Missing              5/11
                                 Unknown              6/24
    d                            Missing              1/11
                                 Unknown              9/24
    All                          Missing             11/11
                                 Unknown             24/24

Maybe help `print grouped_unknown / grouped_unknown.groupby(level=0).transform(sum)` or maybe `print grouped_unknown * 100 / grouped_unknown.groupby(level=0).transform(sum)` — jezrael, Feb 16 '16 at 11:40
I tried this before, this does the percentages _for each action_ not overall, so about us shows 0 for `Missing` and 100 for `Unknown`, instead of maybe showing 0.003, the percentage of `about_us` Missing and Unknown as a percentage of total Missing and Unknown in the whole data frame. — Dhruv Ghulati, Feb 16 '16 at 11:49
I dont understand - you need remove row `All`? Then you can remove `margins=True` from `pd.crosstab(clean_sessions.action_type, clean_sessions.action, margins=True)` — jezrael, Feb 16 '16 at 12:10
Sorry @Dhruv Ghulati, but I create for you test data and I want desired output of this sample data. Instead ot them you create output of your real data. But it is more problematic works with this real data , because I dont have your real data. But if you want, you can add sample desired output of this data and I try to help you: `clean_sessions = pd.DataFrame({'user_id': {0: 'd1', 1: 'd1'}, 'action_type': {0: 'a', 1: 'b'}, 'secs_elapsed': {0: 319, 1: 67753}, 'device_type': {0: 'w', 1: 'w'}, 'action_detail': {0: 'b', 1: 'c'}, 'action': {0: 'b', 1: 'c'}}) ` — jezrael, Feb 19 '16 at 08:45
I think always is the best approach use small test data with 5 - 10 rows, find solution and then use this solution for your real data. Testing is easier and verifying. — jezrael, Feb 19 '16 at 08:49

jezrael · Accepted Answer · 2016-02-19T12:44:10.353

First you can find value of Multindex with key All by xs and then you can try it by original Series. Last you can reset_index:

print df
action  action_type
a       Missing         2
        Unknown         5
b       Missing         3
        Unknown         4
c       Missing         5
        Unknown         6
d       Missing         1
        Unknown         9
All     Missing        11
        Unknown        24
dtype: int64

print df.xs('All')
Missing    11
Unknown    24
dtype: int64
action  action_type

print df / df.xs('All')
action  action_type
a       Missing        0.181818
        Unknown        0.208333
b       Missing        0.272727
        Unknown        0.166667
c       Missing        0.454545
        Unknown        0.250000
d       Missing        0.090909
        Unknown        0.375000
All     Missing        1.000000
        Unknown        1.000000
dtype: float64

print (df / df.xs('All')).reset_index().rename(columns={0:'action_type_percentage'})
  action action_type  action_type_percentage
0      a     Missing                0.181818
1      a     Unknown                0.208333
2      b     Missing                0.272727
3      b     Unknown                0.166667
4      c     Missing                0.454545
5      c     Unknown                0.250000
6      d     Missing                0.090909
7      d     Unknown                0.375000
8    All     Missing                1.000000
9    All     Unknown                1.000000

This gets me to `views_campaign Missing 0.000000 Unknown 100.000000 views_campaign_rules Missing 0.000000 Unknown 100.000000 webcam_upload Missing 0.000000 Unknown 100.000000 weibo_signup_referral_finish Missing 0.000000 Unknown 100.000000 ` instead of calculating the total missing and unknown as a percentage of total missing and unknown, for each action. See comment abov. — Dhruv Ghulati, Feb 16 '16 at 11:51

How to summarise data by percentages in pandas

1 Answers1