0

I have DataFrame in Python Pandas like below:

data type:

  • ID - int64

  • X1 - int64

  • X2 - int64

  • CH - int64

    ID X1 X2 CP CH
    111 1 0 10-20 1
    222 1 0 10-20 1
    333 0 1 30-40 0
    444 1 1 30-40 1
    555 0 1 30-40 1

And I need to create new column "COL1" answered on question:

  • What percentage of customers had CH = '1' per combination: CP x X1 ='1' and CP x X2 = '1'

So as a result I need something like below:

col_X col_CP CH_perc
X1 10-20 1.00 <- 2 IDs had X1 = '1' and CP = '10-20' and 1 of them had CH = '1', so 2/2 = 1.00
X1 20-30 0 <- non of IDs had X1 = '1' and CP = '20-30'
X1 30-40 1.00 <- 1 ID had X1 = '1' and CP = '30-40' and 1 of them had CH = '1', so 1/1 = 1.00
X1 40-50 0 <- non of IDs had X1 = '1' and CP = '40-50'
X2 10-20 0 <- non of IDs had X2 = '1' and CP = '10-20'
X2 20-30 0 <- non of IDs had X2 = '1' and CP = '20-30'
X2 30-40 0.66** <- 3 IDs had X1 = '1' and CP = '30-40' and 2 of them had CH = '1', so 2/3 = 0.66
X2 40-50 0 <- non of IDs had X2 = '1' and CP = '40-50'

How can I do that in Python Pandas ?

unbik
  • 178
  • 9

1 Answers1

0

Code

cps = ['10-20', '20-30', '30-40', '40-50']

x = df.filter(like='X')
x = x.where(x == 1).mul(df['CH'], axis=0)
x = x.groupby(df['CP']).mean().reindex(cps).fillna(0)

How it works?

  • Filter the X like columns
  • mask the values other than 1
  • multiply by CH to ensure that only values where CH == 1 are considered
  • groupby CP and agg with mean to calculate percentage
  • reindex to ensure all CP's are present

Result

        X1        X2
CP                  
10-20  1.0  0.000000
20-30  0.0  0.000000
30-40  1.0  0.666667
40-50  0.0  0.000000
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53