8

If you know exactly how you want to filter a dataframe, the solution is trivial:

df[(df.A == 1) & (df.B == 1)]

But what if you are accepting user input and do not know beforehand how many criteria the user wants to use? For example, the user wants a filtered data frame where columns [A, B, C] == 1. Is it possible to do something like:

def filterIt(*args, value):
    return df[(df.*args == value)]

so if the user calls filterIt(A, B, C, value=1), it returns:

df[(df.A == 1) & (df.B == 1) & (df.C == 1)]
yobogoya
  • 574
  • 7
  • 15
  • Possible dupe: https://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining – chishaku Feb 09 '16 at 22:43
  • https://stackoverflow.com/questions/13611065/efficient-way-to-apply-multiple-filters-to-pandas-dataframe-or-series – chishaku Feb 09 '16 at 22:43
  • Do you always want to compare the different columns with the same value? (1 in this case?) – joris Feb 09 '16 at 23:03

4 Answers4

5

I think the most elegant way to do this is using df.query(), where you can build up a string with all your conditions, e.g.:

import pandas as pd
import numpy as np

cols = {}
for col in ('A', 'B', 'C', 'D', 'E'):
    cols[col] = np.random.randint(1, 5, 20)
df = pd.DataFrame(cols)

def filter_df(df, filter_cols, value):
    conditions = []
    for col in filter_cols:
        conditions.append('{c} == {v}'.format(c=col, v=value))
    query_expr = ' and '.join(conditions)
    print('querying with: {q}'.format(q=query_expr))
    return df.query(query_expr)

Example output (your results may differ due to the randomly generated data):

filter_df(df, ['A', 'B'], 1)
querying with: A == 1 and B == 1
    A  B  C  D  E
6   1  1  1  2  1
11  1  1  2  3  4
Marius
  • 58,213
  • 16
  • 107
  • 105
5

Here is another approach. It's cleaner, more performant, and has the advantage that columns can be empty (in which case the entire data frame is returned).

def filter(df, value, *columns):
    return df.loc[df.loc[:, columns].eq(value).all(axis=1)]

Explanation

  1. values = df.loc[:, columns] selects only the columns we are interested in.
  2. masks = values.eq(value) gives a boolean data frame indicating equality with the target value.
  3. mask = masks.all(axis=1) applies an AND across columns (returning an index mask). Note that you can use masks.any(axis=1) for an OR.
  4. return df.loc[mask] applies index mask to the data frame.

Demo

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 2, (100, 3)), columns=list('ABC'))

# both columns
assert np.all(filter(df, 1, 'A', 'B') == df[(df.A == 1) & (df.B == 1)])

# no columns
assert np.all(filter(df, 1) == df)

# different values per column
assert np.all(filter(df, [1, 0], 'A', 'B') == df[(df.A == 1) & (df.B == 0)])

Alternative

For a small number of columns (< 5), the following solution, based on steven's answer, is more performant than the above, although less flexible. As-is, it will not work for an empty columns set, and will not work using different values per column.

from operator import and_

def filter(df, value, *columns):
    return df.loc[reduce(and_, (df[column] == value for column in columns))]

Retrieving a Series object by key (df[column]) is significantly faster than constructing a DataFrame object around a subset of columns (df.loc[:, columns]).

In [4]: %timeit df['A'] == 1
100 loops, best of 3: 17.3 ms per loop

In [5]: %timeit df.loc[:, ['A']] == 1
10 loops, best of 3: 48.6 ms per loop

Nevertheless, this speedup becomes negligible when dealing with a larger number of columns. The bottleneck becomes ANDing the masks together, for which reduce(and_, ...) is far slower than the Pandas builtin all(axis=1).

Community
  • 1
  • 1
Igor Raush
  • 15,080
  • 1
  • 34
  • 55
1

This is pretty messy but it seems to work.

import operator

def filterIt(value,args):
    stuff = [getattr(b,thing) == value for thing in args]
    return reduce(operator.and_, stuff)

a = {'A':[1,2,3],'B':[2,2,2],'C':[3,2,1]}
b = pd.DataFrame(a)
filterIt(2,['A','B','C'])

0    False
1     True
2    False
dtype: bool



(b.A == 2) & (b.B == 2) & (b.C ==2)

0    False
1     True
2    False
dtype: bool
Igor Raush
  • 15,080
  • 1
  • 34
  • 55
steven
  • 226
  • 1
  • 13
1

Thanks for the help guys. I came up with something similar to Marius after finding out about df.query():

def makeQuery(cols, equivalence=True, *args):
operator = ' == ' if equivalence else ' != '
query = ''
for arg in args:
    for col in cols:
        query = query + "({}{}{})".format(col, operator, arg) + ' & '

return query[:-3]


query = makeQuery([A, B, C], False, 1, 2)

Contents of query is a string:

(A != 1) & (B != 1) & (C != 1) & (A != 2) & (B != 2) & (C != 2) 

that can be passed to df.query(query)

yobogoya
  • 574
  • 7
  • 15