Pandas/Python: How to create new column based on values from other columns and apply extra condition to this new column

Question

I have a pandas dataframe and I want to create a new column BB based on the below condition.

Create a new column BB, if the values in column TGR1 is 0, assign 0 to BB else,
The value in TGR1 is not 0, look up the columns ('1','2','3') that corresponds with the value in TGR1 assign the value in that column(either '1','2','3') to the new column BB.

I was able to achieve the first step using

df.loc[df['TGR1'] == 0, 'BB'] = 0

I also tried to use np.where to come up with but I can figure out the right way to go about this.

df['BB'] = np.where(df.TGR1 == 0,0, df.columns == test.TGR1.value )
    
    

Dist    Track    EVENT_ID      Date       1      2        3   TGR1 TGR2
                            
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   1   0
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   2   1
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   0   2
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   3   1
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   2   2
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   1   2

Expected Output:

Dist    Track    EVENT_ID      Date       1      2        3   TGR1 TGR2    BB     
                            
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   1   0     34.00        
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   2   1     5.18     
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   0   2       0
311m    Cran    174331755   2020-10-19  34.00   5.18    19.10   3   1     19.10

score 4 · Accepted Answer · answered Sep 26 '21 at 00:49

One way is to use numpy advanced indexing:

import numpy as np
# extract columns 1,2,3 into a numpy array with a zeros column stacked on the left
vals = np.column_stack((np.zeros(len(df)), df[list('123')]))

vals
array([[ 0.  , 34.  ,  5.18, 19.1 ],
       [ 0.  , 34.  ,  5.18, 19.1 ],
       [ 0.  , 34.  ,  5.18, 19.1 ],
       [ 0.  , 34.  ,  5.18, 19.1 ],
       [ 0.  , 34.  ,  5.18, 19.1 ],
       [ 0.  , 34.  ,  5.18, 19.1 ]])

# use TGR1 values as the column index to extract corresponding values
df['BB'] = vals[np.arange(len(df)), df.TGR1.values]

df
   Dist Track   EVENT_ID        Date     1     2     3  TGR1  TGR2     BB
0  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     1     0  34.00
1  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     2     1   5.18
2  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     0     2   0.00
3  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     3     1  19.10
4  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     2     2   5.18
5  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1     1     2  34.00

Thanks, @Psidom I accepted your answer because in terms of speed it is faster and can be scaled for multiple columns in a dataframe. — chuky pedro, Sep 26 '21 at 10:47
I get the below error when I tried the above function on my Dataframe `IndexError: arrays used as indices must be of integer (or boolean) type` — chuky pedro, Sep 26 '21 at 13:22
@ChukypedroOkolie it seems your TGR1 column is not of integer type, try convert it first. `vals[np.arange(len(df)), df.TGR1.values.astype(int)]` — Psidom, Sep 26 '21 at 14:22

rpanai · Answer 2 · 2021-09-27T13:05:40.550

Here you can try to play some numpy trick as in this answer.

We first define a matrix with values from columns 1,2 and 3 and add a first column with zeros.

import pandas as pd
import numpy as np

# we first define a matrix 
# with len(df) rows and 4 columns
mat = np.zeros((len(df), 4))

# Then we fill the last 3 columns 
# with values from df
mat[:,1:] = df[["1", "2", "3"]].values

# Then a vector with values from df["TGR1"]
v = df["TGR1"].values


# Finally we take the given index
# from each row on matrix
df["BB"] = np.take_along_axis(mat, v[:,None], axis=1)

Timing

I compared the timing for some of the answers here. I just took a df 10_000 larger than the original one

df_bk = pd.concat([df for i in range(10_000)], ignore_index=True)

and before run each test I do df = df_bk.copy()

@wwnde's solution

CPU times: user 430 ms, sys: 12.1 ms, total: 442 ms
Wall time: 452 ms

@cookesd's solution

CPU times: user 746 ms, sys: 0 ns, total: 746 ms
Wall time: 746 ms

@rpanai's solution

CPU times: user 5.54 ms, sys: 0 ns, total: 5.54 ms
Wall time: 4.84 ms

@Psidom's solution

CPU times: user 5.93 ms, sys: 141 µs, total: 6.07 ms
Wall time: 5.61 ms

Psidom's solution and mine have basically the same timing. Here is a plot

score 1 · Answer 3 · answered Sep 26 '21 at 01:03

You can create the column using a list comprehension with your if-else logic

# Sample data
df = pd.DataFrame({'TGR1':[random.randint(0,3) for i in range(10)],
                   '1':[random.randint(0,100) for i in range(10)],
                   '2':[random.randint(101,200) for i in range(10)],
                   '3':[random.randint(201,300) for i in range(10)]})
# creating the column
df['BB'] = [0 if tgr1_val == 0 else df.loc[ind,str(tgr1_val)]
            for ind,tgr1_val in enumerate(df['TGR1'].values)]

df

#    TGR1   1    2    3   BB
# 0     0  54  107  217    0
# 1     2  71  128  277  128
# 2     1  25  103  269   25
# 3     0  80  112  279    0
# 4     2  98  167  228  167
# 5     3  26  192  285  285
# 6     0  27  107  228    0
# 7     2  13  103  298  103
# 8     3  28  196  289  289
# 9     2  72  186  251  186

score 1 · Answer 4 · answered Sep 26 '21 at 01:16

this is done easy with the use of boolean mask as you did it in your step one:

df['BB'][df['TGR1'] == 0] = 0

for the other values greaters than 0:

df['BB'][df['TGR1'] == 1] = df['1'][df['TGR1'] == 1]
df['BB'][df['TGR1'] == 2] = df['2'][df['TGR1'] == 2]
df['BB'][df['TGR1'] == 3] = df['3'][df['TGR1'] == 3]

output:
    1         2       3   TGR1   BB
0   34.0    5.18    19.1    1   34.00
1   34.0    5.18    19.1    2   5.18
2   34.0    5.18    19.1    0   0.00
3   34.0    5.18    19.1    3   19.10
4   34.0    5.18    19.1    2   5.18

probably it is pretty much readable.

Thanks for your response. I agree with it in terms of readability but it gets longer as the number of columns `('1','2','3')` grows. — chuky pedro, Sep 26 '21 at 10:45

score 0 · Answer 5 · edited Sep 27 '21 at 12:47

Drop TGR2temporarily, do alook up of columns using TGR1 and that should do. code below

s = df.astype(str).drop('TGR2',1).filter(regex='\d', axis=1).reset_index()#Drop TRG2 and filter rows with digits to allow lookup
i = s.astype(str).columns.get_indexer(s.TGR1)#DO alook up to get columns whose values are in TGR1
df['BB'] = s.values[s.index,i]

   Dist Track   EVENT_ID        Date     1     2     3 TGR1 TGR2    BB
0  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    1    0  34.0
1  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    2    1  5.18
2  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    0    2     0
3  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    3    1  19.1
4  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    2    2  5.18
5  311m  Cran  174331755  2020-10-19  34.0  5.18  19.1    1    2  34.0