1

Creating a column in a dataframe based on multiple conditions based on values from different columns.

The goal is to get an indication of when the first interesting action was of a customer, this will be represented with a 1 under t0.

Dataframe is structured as below:

      cust_id       first_act     prod_1  prod_2   t0
0      1                  1          1              
22     2                                            
23     2                                     1                      
24     2                             1              
25     2                                            
26     3                  1
27     3
28     3
29     4
30     4

I want to assign a value to the column t0 based on the following conditions:

if customer has a 1 under prod_1: assign value 1 to t0 at the index where it has the 1 under prod_1.

if customer does not have a 1 under prod_1, check if customer has a 1 under prod_2 and if true assign t0 the value of 1 at the index where the condition is true.

lastly: if the customer does not have prod_1 or prod_2 but does have a 1 under first_act, assign the value 1 to the index where first act is true, under t0.

After these conditions there should only be one value in t0 for each customer.

Expected output for cust_id 2:

 cust_id       first_act     prod_1  prod_2   t0
0      1            1          1              
22     2            1                                
23     2                               1                      
24     2                       1               1    
25     2                                            
26     3            1
27     3
28     3
29     4
30     4

I tried doing this with nested np.where statements but that did not work as the following:

df['t0'] = np.where(df['prod_1'] == 1, 1 ,
                         np.where(df['prod_2'] == 1, 1,
                                 np.where(df['first_act'] == 1, 1, 0)))

Adds 1's to t0 on multiple locations.

Update

@Jeffyx I don't know if this clears it up a bit but what I thought of was this:

if prod_1 == 1:
    t0 = 1 at index of prod_1 == 1
if not prod_1 == 1:
    if prod_2 == 1:
        t0 = 1 at index of prod_2 == 1
if not prod_1 == 1 and not prod_2 == 1:
    if first_act == 1:
        t0 = 1 at index of first_act == 1
merdy
  • 13
  • 3
  • I'm not going to lie, I'm having a hard time understanding your requirements for ['t0'] to have a 1, but I know a simple way to answer you question is to use [link] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) pandas loc. Its an easy way to update a column with multiple conditions. – Jeffyx Jun 06 '19 at 12:58
  • 1
    Agree with @Jeffyx -- I read through the requirements and checked to see if my expected output matched yours, and it's not even close. I'll try to look again after your update. – Tim S. Jun 06 '19 at 13:04
  • I understand it could be confusing. Tried rephrasing, does it help? – merdy Jun 06 '19 at 13:09

1 Answers1

0

You must find the first index matching your condition, and then use that index to set a value in the t0 column.

Using a groupby, it gives:

for _, sub in df.groupby(['cust_id']):              # test for each cust_id
    for col in ['prod_1', 'prod_2', 'first_act']:   # test columns in sequence
        tmp = sub[sub[col] == 1]                    # try to match
        if len(tmp) != 0:                           # ok found at least one
            df.loc[tmp.index[0], 't0'] = 1          # set t0 to 1 for first index found
            break
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thank you, this is perfect. I did not know that you could loop through groups by using the _. – merdy Jun 06 '19 at 13:32
  • Iterating on a `groupby` returns pairs (index, sub_dataframe). The `-` variable is just an indication that I won't use the index here. – Serge Ballesta Jun 06 '19 at 13:44