2

I have a data frame df that looks like the following:

import pandas as pd
df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})

For the columns a, b and c I want to extract (into a list) the min values, while for column d I want to get the max value i.e. :

[77.26, 94.06, 0.71, 0.29]

I am mainly trying to get this done with lambda expressions

to get all the min values, for instance, I could:

df.apply(lambda x:x.min(), axis = 0)

I thought about something like (of course it is not working):

df_final.apply(lambda x:x.max() if x =='d' else x.min(), axis = 0)

I have found this question which is doing something similar, though the conditional statement is based on the values of each columns, while I want my if else statement based on the column names.The thing is that x i.e. the iterable object is not the column names. How can I then apply if else conditions with lambda functions based on column names?

rafaelc
  • 57,686
  • 15
  • 58
  • 82
BCArg
  • 2,094
  • 2
  • 19
  • 37
  • Does this answer your question? [Apply function to each cell in DataFrame that depends on the column name in pandas](https://stackoverflow.com/questions/44938029/apply-function-to-each-cell-in-dataframe-that-depends-on-the-column-name-in-pand) – rdmolony Oct 12 '21 at 13:46

5 Answers5

6

Use numpy.where:

a = np.where(df.columns == 'd', df.max(), df.min()).tolist()
print (a)
[77.26, 94.06, 0.71, 0.29]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

In general, try not to use apply because it does not perform vectorized operations (i.e. it is slow).

Here, you can just select the columns you want and sum the lists

min_cols = ['a', 'b', 'c']
max_cols = ['d']

>>> df[min_cols].min().tolist() + df[max_cols].max().tolist()
[77.26, 94.06, 0.71, 0.29]
rafaelc
  • 57,686
  • 15
  • 58
  • 82
2

Use agg with dictionary:

df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()

Output:

[77.26, 94.06, 0.71, 0.29]
BENY
  • 317,841
  • 20
  • 164
  • 234
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

You can use the name attribute on the Series:

df.apply(lambda x: x.max() if x.name == 'd' else x.min())
#a    77.26
#b    94.06
#c     0.71
#d     0.29
#dtype: float64

Naive Timings for your reference assuming you don't have a lot of columns:

Small data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
​    
%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 770 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
# 268 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 814 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.02 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 18.7 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Large data frame:

df = pd.DataFrame({'a':[78.78, 77.26], 'b':[94.47,94.06], 'c':[0.72, 0.71], 'd':[0.19, 0.29]})
​
df = pd.concat([df] * 10000)

%timeit df.apply(lambda x: x.max() if x.name == 'd' else x.min()).tolist()
# 1.03 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.np.where(df.columns == 'd', df.max(), df.min()).tolist()
#1.81 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.agg({'a':'min','b':'min','c':'min','d':'max'}).tolist()
# 1.07 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[min_cols].min().tolist() + df[max_cols].max().tolist()
# 1.9 ms ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
# 25.7 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    Just a caveat: the timings are misleading in that for both "large data frame" and "small data frame" there are only 4 columns. If you had, for example, the same code, but `df` was `df.T`, `apply` would perform incredibly worse while other solutions would keep similar time – rafaelc Nov 06 '18 at 16:33
  • @RafaelC Yes. This does assume there are not a lot of columns. Maybe it's better to say there are practical cases if not a lot, the `apply` method is perfect fine. – Psidom Nov 06 '18 at 16:38
0

Using describe

df.describe().loc['min','a':'c'].tolist()+df.describe().loc['max',['d']].tolist()
Out[276]: [77.26, 94.06, 0.71, 0.29]
BENY
  • 317,841
  • 20
  • 164
  • 234