Using if else statements on lambda expressions on a pandas data frame based on column names

Question

I have a data frame df that looks like the following:

import pandas as pd
df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

For the columns a, b and c I want to extract (into a list) the min values, while for column d I want to get the max value i.e. :

[77.26, 94.06, 0.71, 0.29]

I am mainly trying to get this done with lambda expressions

to get all the min values, for instance, I could:

df.apply(lambda x:x.min(), axis = 0)

I thought about something like (of course it is not working):

df_final.apply(lambda x:x.max() if x =='d' else x.min(), axis = 0)

I have found this question which is doing something similar, though the conditional statement is based on the values of each columns, while I want my if else statement based on the column names.The thing is that x i.e. the iterable object is not the column names. How can I then apply if else conditions with lambda functions based on column names?

Does this answer your question? [Apply function to each cell in DataFrame that depends on the column name in pandas](https://stackoverflow.com/questions/44938029/apply-function-to-each-cell-in-dataframe-that-depends-on-the-column-name-in-pand) — rdmolony, Oct 12 '21 at 13:46

score 6 · Answer 1 · answered Nov 06 '18 at 15:18

6

Use numpy.where:

a = np.where(df.columns == 'd', df.max(), df.min()).tolist()
print (a)
[77.26, 94.06, 0.71, 0.29]

answered Nov 06 '18 at 15:18

jezrael

822,522
95
1,334
1,252

score 3 · Answer 2 · answered Nov 06 '18 at 15:19

In general, try not to use apply because it does not perform vectorized operations (i.e. it is slow).

Here, you can just select the columns you want and sum the lists

min_cols = ['a', 'b', 'c']
max_cols = ['d']

>>> df[min_cols].min().tolist() + df[max_cols].max().tolist()
[77.26, 94.06, 0.71, 0.29]

score 2 · Answer 3 · edited Nov 06 '18 at 15:20

2

Use agg with dictionary:

df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()

Output:

[77.26, 94.06, 0.71, 0.29]

edited Nov 06 '18 at 15:20

BENY

317,841
20
164
234

answered Nov 06 '18 at 15:19

Scott Boston

147,308
15
139
187

Psidom · Accepted Answer · 2018-11-06T16:39:15.100

You can use the name attribute on the Series:

df.apply(lambda x: x.max() if x.name == 'd' else x.min())
#a    77.26
#b    94.06
#c     0.71
#d     0.29
#dtype: float64

Naive Timings for your reference assuming you don't have a lot of columns:

Small data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
    
%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 770 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
# 268 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 814 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.02 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 18.7 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

df = pd.concat([df] * 10000)

%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 1.03 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
#1.81 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 1.07 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.9 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 25.7 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Just a caveat: the timings are misleading in that for both "large data frame" and "small data frame" there are only 4 columns. If you had, for example, the same code, but `df` was `df.T`, `apply` would perform incredibly worse while other solutions would keep similar time — rafaelc, Nov 06 '18 at 16:33
@RafaelC Yes. This does assume there are not a lot of columns. Maybe it's better to say there are practical cases if not a lot, the `apply` method is perfect fine. — Psidom, Nov 06 '18 at 16:38

score 0 · Answer 5 · answered Nov 06 '18 at 15:23

0

Using describe

df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
Out[276]: [77.26, 94.06, 0.71, 0.29]

answered Nov 06 '18 at 15:23

BENY

317,841
20
164
234

Using if else statements on lambda expressions on a pandas data frame based on column names

5 Answers5