Condition between duplicated values in a column

Question

Every customer is duplicated when they have more than one plan. I want to set the status to the customer:

If they have every product with 'canceled_at' filled, the customer status is cancelled, but when it's not every product with the canceled_at filled, but at least one, the status is 'downgrade' because he lost a product.

customer|canceled_at|status
x       |3/27/2018  |
x       |           |
y       |2/2/2018   |
y       |2/2/2018   |
z       |1/1/2018   |
a       |           |

I already have the canceled status, now i only need the downgrade

df['status']=(df.groupby('customer')['canceled_at'].
  transform(lambda x: x.notna().all()).map({True:'canceled'})).fillna(df.status)

customer|canceled_at|status
x       |3/27/2018  |downgrade
x       |           |downgrade
y       |2/2/2018   |canceled
y       |2/2/2018   |canceled
z       |1/1/2018   |canceled
a       |           |

jezrael · Accepted Answer · 2019-03-27T11:18:04.800

Here is possible compare column for no missing values and grouping by Series customer with GroupBy.transform and GroupBy.all, GroupBy.any for test all values Trues (all non missing) or at least one value not missing (any non missing) and pass it to numpy.select:

g = df['canceled_at'].notna().groupby(df['customer'])
m1 = g.transform('all')
m2 = g.transform('any')

df['status'] = np.select([m1, m2],['canceled','downgrade'], np.nan)
print (df)
  customer canceled_at     status
0        x   3/27/2018  downgrade
1        x         NaN  downgrade
2        y    2/2/2018   canceled
3        y    2/2/2018   canceled
4        z    1/1/2018   canceled
5        a         NaN        nan

Or:

df['status'] = np.select([m1, m2],['canceled','downgrade'], '')
print (df)
  customer canceled_at     status
0        x   3/27/2018  downgrade
1        x         NaN  downgrade
2        y    2/2/2018   canceled
3        y    2/2/2018   canceled
4        z    1/1/2018   canceled
5        a         NaN

If only NaNs groups need convert to downgrade:

mask = df['canceled_at'].notna().groupby(df['customer']).transform('all')
df['status'] = np.where(mask,'canceled','downgrade')
print (df)
  customer canceled_at     status
0        x   3/27/2018  downgrade
1        x         NaN  downgrade
2        y    2/2/2018   canceled
3        y    2/2/2018   canceled
4        z    1/1/2018   canceled
5        a         NaN  downgrade

score 1 · Answer 2 · answered Mar 27 '19 at 11:26

Here is a way to do it:

import pandas as pd

def select_status(canceled):
    c = canceled.count()
    if c == 0:
        status = ''
    elif c == len(canceled):
        status = 'canceled'
    else:
        status = 'downgrade'
    return pd.Series(status, index=canceled.index)

df = pd.DataFrame({'customer': ['x', 'x', 'y', 'y', 'z', 'a'],
                   'canceled_at': ['3/27/2018', None, '2/2/2018', '2/2/2018', '1/1/2018', None]})
df['status'] = df.groupby('customer')['canceled_at'].apply(select_status)
print(df)

Output:

  customer canceled_at     status
0        x   3/27/2018  downgrade
1        x        None  downgrade
2        y    2/2/2018   canceled
3        y    2/2/2018   canceled
4        z    1/1/2018   canceled
5        a        None

Condition between duplicated values in a column

2 Answers2