Efficient way to apply multiple filters to pandas DataFrame or Series

Question

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.

The filters should be additive (aka each one applied should narrow results).
I'm currently using reindex() (as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame.
I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.
Also, I would like to expand this so that the dictionary passed in can include the columns to operate on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for a Series can be easily expanded to a DataFrame.

TL;DR

I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.

relops = {'>=': [1], '<=': [1]}

Long Example

I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

The user provides a dictionary with the operations they want to perform:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.

Also, I'm fully aware that this approach to the problem might be way off. So maybe rethinking the entire approach would be useful. I just want to allow users to specify a set of filter operations at runtime and execute them. — durden2.0, Nov 28 '12 at 17:35
I'm wondering if pandas can do similar things as data.table in R: df[col1<1,,][col2>=1] — xappppp, Jun 04 '18 at 13:58
`df.query` and `pd.eval` seem like good fits for your use case. For information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). — cs95, Dec 16 '18 at 04:53

Andy Hayden · Accepted Answer · 2018-05-23T19:12:50.683

359

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

edited May 23 '18 at 19:12

answered Nov 28 '12 at 23:38

Andy Hayden

359,921
101
625
535

1

Your right, boolean is more efficient since it doesn't make a copy of the data. However, my scenario is a bit more tricky than your example. The input I receive is a dictionary defining what filters to apply. My example could do something like `df[(ge(df['col1'], 1) & le(df['col1'], 1)]`. The issue for me really is the dictionary with the filters could contain lots of operators and chaining them together is cumbersome. Maybe I could add each intermediate boolean array to a big array and then just use `map` to apply the `and` operator to them? – durden2.0 Nov 29 '12 at 03:56
@durden2.0 I've added an idea for a helper function, which I think is similar to what you are looking for :) – Andy Hayden Nov 29 '12 at 09:45
That looks very close to what I came up with! Thanks for the example. Why does `f()` need to take `*b` instead of just `b`? Is this so user of `f()` could still use the optional `out` parameter to `logical_and()`? This leads to another small side-question. What is the performance benefit/trade off of passing in array via `out()` vs. using the one returned from `logical_and()`? Thanks again! – durden2.0 Nov 29 '12 at 14:30
Nevermind, I didn't look close enough. The `*b` is necessary because you are passing the two arrays `b1` and `b2` and you need to unpack them when calling `logical_and`. However, the other question still stands. Is there a performance benefit to passing in an array via `out` parameter to `logical_and()` vs just using its' return value? – durden2.0 Nov 29 '12 at 14:55
No, sorry I was lazy and didn't try it myself. I was just curious of any theoretical information behind it. – durden2.0 Nov 29 '12 at 17:03
this 'helper function' method seems to have greater potential to be used dynamically (e.g. allow for checking whether or not a filter was passed) vs. the query string which is static and hard-coded. However, this answer was originally posted in 2012, so I am wondering if it is still relevant now or if there are easier ways available? – user5359531 Feb 16 '17 at 02:13
@user5359531 the string is just a string and so can be created dynamically and passed to the query method. The update (of the query method) is from 2014 and still valid! – Andy Hayden Feb 16 '17 at 04:45
Great answer, which led me to what I needed. Small note - seeing `*b`, I thought maybe I could pass multiple (ie more than 2) functions through, but `np.logical_and` only takes 2 arguments. My quick fix was to change `f` to something like (excuse the terrible comment-formatting): `def f(x, *bs): base = bs[0]; for b in bs[1:]: base = np.logical_and(base, b); return x[(base)]` - now you can pass in an arbitrary number of filters (even just 1 - it won't ever call `np.logical_and` in that case!) – dwanderson Feb 27 '17 at 19:23
I don't think columns that are not valid identifiers can be used with the query method: https://github.com/pandas-dev/pandas/issues/6508 – Emil Gotsev May 23 '18 at 09:52
@EmilGotsev ah, yes, it does look like that's no longer supported. You used to be able to have access to all variables iirc. – Andy Hayden May 23 '18 at 19:12
hey what about the or filter – Saurabh Yadav Dec 03 '18 at 12:39
2

@dwanderson you can pass a list of conditions to np.logical_and.reduce for multiple conditions. Example: np.logical_and.reduce([df['a']==3, df['b']>10,df['c'].isin(1,3,5)]) – Kuzenbo Apr 28 '19 at 07:45

score 62 · Answer 2 · edited Oct 07 '22 at 18:09

62

Chaining conditions creates long lines, which are discouraged by PEP8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.

Once each of the filters is in place, one approach could be:

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c_1,c_2,c_3)]

np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.

Note that this still has some redundancies:

Shortcutting does not happen on a global level
Each of the individual conditions runs on the whole initial data

Still, I expect this to be efficient enough for many applications and it is very readable. You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:

import numpy as np
import functools
def disjunction(*conditions):
    return functools.reduce(np.logical_or, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[disjunction(c_1,c_2,c_3)]

edited Oct 07 '22 at 18:09

Taku

31,927
11
74
85

answered Jun 11 '15 at 10:28

Gecko

1,379
11
14

1

Is there a way to implement this for a variable number of conditions? I have tried appending each `c_1`, `c_2`, `c_3`, ... `c_n` in a list, and then passing `data[conjunction(conditions_list)]` but get an error `ValueError: Item wrong length 5 instead of 37.` Also tried `data[conjunction(*conditions_list)]` but I get a different result than `data[conjunction(c_1, c_2, c_3, ... c_n )]`, not sure what is going on. – user5359531 Feb 16 '17 at 03:04
Found a solution to the error elsewhere. `data[conjunction(*conditions_list)]` does work after packing the dataframes into a list, and [unpacking the list in place](https://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists) – user5359531 Feb 16 '17 at 03:21
1

I just left a comment on the above answer with a much sloppier version, and then noticed your answer. Very clean, I like it a lot! – dwanderson Feb 27 '17 at 22:09
1

This is a great answer! – Charlie Crown Jul 16 '19 at 19:59
2

i'd used: `df[f_2 & f_3 & f_4 & f_5 ]` with `f_2 = df["a"] >= 0`etc. No need for that function... (nice use of higher order function though...) – A. Rabus Aug 29 '19 at 08:29
not to nitpick but just to confirm - you need `c_1`, `c_2` and `c_3` in your disjunction function above not c1, c2,and c3 ? – vagabond May 05 '21 at 03:58

Gil Baggio · Answer 3 · 2018-07-24T08:25:46.320

45

Simplest of All Solutions:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

edited Jul 24 '18 at 08:25

answered Feb 27 '18 at 12:16

Gil Baggio

13,019
3
48
37

i am using variable instead of constant. getting error. df[df[ ]][df[ ]] gives warning message but gives correct answer. – Nguai al Jan 16 '19 at 05:43

score 14 · Answer 4 · edited Apr 06 '21 at 06:28

Since pandas 0.22 update, comparison options are available like:

gt (greater than)
lt (less than)
eq (equals to)
ne (not equals to)
ge (greater than or equals to)

and many more. These functions return boolean array. Let's see how we can use them:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

I think this is the cleanest and most efficient option and you can also assign these conditions e. g. `c0 = df.loc[:, "area"].eq("TNG")` and use them for slicing e. g. `df.loc[c0 & c1, ...]`. — Cord Kaldemeyer, Jun 03 '23 at 07:21

score 5 · Answer 5 · answered May 18 '16 at 10:06

Why not do this?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

Demo:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

Result:

You can see that column 'a' has been filtered where a >=2.

This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.

score 4 · Answer 6 · answered Mar 05 '19 at 08:34

e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.

For example

list = [1, 0]
df[df.col1.isin(list)]

score 4 · Answer 7 · answered Oct 30 '20 at 11:53

4

If you want to check any/all of multiple columns for a value, you can do:

df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

answered Oct 30 '20 at 11:53

wonderkid2

4,654
1
20
20

Efficient way to apply multiple filters to pandas DataFrame or Series

TL;DR

Long Example

7 Answers7

Linked

Related