Dynamic comparison of values of n multiple Pandas columns

Question

Let's say a user can input the columns and values to compare for a DF, so we can have:

column_list = ['col1', 'col2', 'col3']
value_list = [val1, val2, val3]

So to select the rows that satisfy where col1 >= val1 AND col2 >= val2 AND col3 >= val3 we would write:

selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

or it can be in the form:

selection  = df.loc[(df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3)]

The number of columns is not known in advance, so we can have n columns. We can try this approach:

if n=1:
   selection = (df['col1'] >= val1))
elif n=2:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2))
elif n=3:
   selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))

But this is neither scalable nor efficient. I tried by generating strings df['col<>'] >= val<>) with a for loop given the input lists but it didn't work for Pandas because of the str format.

What would be the best pythonic approach for this? To avoid having all the options with if and else statements.

OP, can you please confirm whether or not you mean to be mixing different comparison operators? — ddejohn, Sep 23 '21 at 08:09

mozway · Accepted Answer · 2021-09-23T07:49:59.573

2

To perform a comparison with the same operator for all columns, create a Series with the values and columns ids and use it to perform an aligned comparison with the dataframe:

df[df.gt(pd.Series(value_list, index=column_list)).all(1)]

Example input:

>>> value_list
[3, 7, 11]
>>> df
   col1  col2  col3
0     0     1     2
1     3     4     5
2     6     7     8
3     9    10    11
4    12    13    14

output:

   col1  col2  col3
4    12    13    14

intermediates:

>>> pd.Series(value_list, index=column_list)
col1     3
col2     7
col3    11

>>> df.gt(pd.Series(value_list, index=column_list))
    col1   col2   col3
0  False  False  False
1  False  False  False
2   True  False  False
3   True   True  False
4   True   True   True

>>> df.gt(pd.Series(value_list, index=column_list)).all(1)
0    False
1    False
2    False
3    False
4     True

edited Sep 23 '21 at 07:49

answered Sep 23 '21 at 07:44

mozway

194,879
13
39
75

Special emphasis on ***with the same operator***. Nice solution if this is what OP ends up needing. – ddejohn Sep 23 '21 at 08:21
it's simple and it solved the question as all the operators were the same, thank you! [mozway](https://stackoverflow.com/users/16343464/mozway) – juanman Sep 23 '21 at 14:00
1

@juanman note that since you want `>=` you should use `ge` instead of `gt`. – ddejohn Sep 23 '21 at 16:54

ddejohn · Answer 2 · 2021-09-23T08:19:38.737

With a df like this,

And values, columns, and arbitrary operators,

In [2]: import operator

In [3]: values = [1, 2, 7]

In [4]: columns = ['a', 'b', 'c']

In [5]: operators = [operator.gt, operator.ge, operator.le]  # >,  >=,  <=

Make a copy of df and iterate over the zipped items:

In [6]: selection = df.copy()

In [7]: for col, op, val in zip(columns, operators, values):
   ...:     selection = selection[op(selection[col], val)]
   ...:

In [8]: selection
Out[8]:
   a  b  c
2  6  2  6
3  2  6  3
5  5  9  0

Of course, if you don't know ahead of time how many columns there are, then it seems as though you also likely don't know the operators ahead of time either, which sort of defeats the purpose. This would become much easier if you only had to use a single operator but it looks like that's not the case, judging by your examples.

If your original post does indeed have typos (should all comparisons be >, or all of them >=?) and you actually intend to perform a single comparison operation, see this answer from @mozway.

Thank you for the input [ddejohn](https://stackoverflow.com/users/6298712/ddejohn), it allows to have other combination of operators — juanman, Sep 23 '21 at 14:02

mozway · Answer 3 · 2021-09-23T08:13:53.167

Here is another possibility to handle an arbitrary long list of comparisons (I am posting separately as the approach is quite different):

column_list = ['col1', 'col2', 'col3']
value_list = [3, 7, 11]
operator_list = [pd.Series.gt, pd.Series.ge, pd.Series.gt]

op_dic = dict(zip(column_list, operator_list))
val_dic = dict(zip(column_list, value_list))

df[df.apply(lambda c: op_dic[c.name](c, val_dic[c.name])).all(1)]

How it works:

Using apply, we perform a custom operation on all columns that will return a boolean per row, then take the rows that are True for all.

input:

   col1  col2  col3
0     0     1     2
1     3     4     5
2     6     7     8
3     9    10    11
4    12    13    14

output:

   col1  col2  col3
4    12    13    14

NB. to be more concise one can also do:

from operator import gt, ge
operator_list = [gt, ge, gt]

Dynamic comparison of values of n multiple Pandas columns

3 Answers3