6

I have a Pandas DataFrame with customer refund reasons. It contains these example data rows:

    **case_type**       **claim_type**
1   service             service
2   service             service
3   chargeback          service
4   chargeback          local_charges
5   service             supplier_service
6   chargeback          service
7   chargeback          service
8   chargeback          service
9   chargeback          service
10  chargeback          service
11  service             service_not_used
12  service             service_not_used

I would like to compare the customer's reason with some sort of labeled reason. This is no problem, but I would also like to see the total number of records in a specific group (customer reason).

case_claim_type = df[["case_type", "claim_type"]]
case_claim_type.groupby(by=("case_type", "claim_type"))["case_type"].count()

Which gives me this output, for example:

**case_type**     **claim_type**                 
service           service                         2
                  supplier_service                1
                  service_not_used                2
chargeback        service                         6
                  local_charges                   1

I would also like to have have the sum of the output per case_type. Something like:

**case_type**     **claim_type**                 
service           service                         2
                  supplier_service                1
                  service_not_used                2
                  total:                          5
chargeback        service                         6
                  local_charges                   1
                  total:                          7

It doesn't necessarily has to be in this last output format, a column with the (aggregated) totals per case_type is also fine.

eppe2000
  • 110
  • 1
  • 1
  • 8

2 Answers2

6

You can use:

df = case_claim_type.groupby(by=("case_type", "claim_type"))["case_type"].count()
print (df)
case_type   claim_type      
chargeback  local_charges       1
            service             1
service     service             2
            supplier_service    1
Name: case_type, dtype: int64

You can create new DataFrame by aggregate sum and add MultiIndex by MultiIndex.from_tuples:

df1 = df.sum(level=0)
#same as
#df1 = df.groupby(level=0).sum()
new_cols= list(zip(df1.index.get_level_values(0),['total'] * len(df.index)))
df1.index = pd.MultiIndex.from_tuples(new_cols)
print (df1)
chargeback  total    2
service     total    3
Name: case_type, dtype: int64

Then concat together and last sort_index:

df2 = pd.concat([df,df1]).sort_index()
print (df2)
case_type   claim_type      
chargeback  local_charges       1
            service             1
            total               2
service     service             2
            supplier_service    1
            total               3
Name: case_type, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the solution, this is very intuitive. I've tried this and it works very well. However, I marked the other answer as solution as the answer is a bit more concise. – eppe2000 Feb 21 '18 at 07:59
  • @eppe2000 - If you get 2 nice answers, it is always hard ;) But it is up you what solution use ;) good luck! – jezrael Feb 21 '18 at 08:01
  • Hey @jezrael - Necromancing this a little bit, I know, but I was looking around for an answer close to this one but where the Total can be done for more than just addition; say, means as well. Any clue? – cjcrm Feb 12 '21 at 10:48
  • @cjcrm - Then change `df1 = df.sum(level=0)` to `df1 = df.mean(level=0)` – jezrael Feb 12 '21 at 10:55
  • @jezrael right, but what if my groupby uses, say, groupby(...).agg({'case_type':['sum','mean']), what would be the best way to get the case_type's correct values for each agg function applied? – cjcrm Feb 12 '21 at 10:59
  • @cjcrm - I think then need filtering and for each dataframe use correct aggregate function. But maybe the best is create question with sample data and expected output. – jezrael Feb 12 '21 at 11:03
6

Where:

df = pd.DataFrame({'case_type':['Service']*20+['chargeback']*9,'claim_type':['service']*5+['local_charges']*5+['service_not_used']*5+['supplier_service']*5+['service']*8+['local_charges']})

df_out = df.groupby(by=("case_type", "claim_type"))["case_type"].count()

Let use pd.concat, sum with level parameter, and assign:

(pd.concat([df_out.to_frame(),
           df_out.sum(level=0).to_frame()
                 .assign(claim_type= "total")
                 .set_index('claim_type', append=True)])
  .sort_index())

Output:

                             case_type
case_type  claim_type                 
Service    local_charges             5
           service                   5
           service_not_used          5
           supplier_service          5
           total                    20
chargeback local_charges             1
           service                   8
           total                     9
Scott Boston
  • 147,308
  • 15
  • 139
  • 187