Here is another approach. It's cleaner, more performant, and has the advantage that columns
can be empty (in which case the entire data frame is returned).
def filter(df, value, *columns):
return df.loc[df.loc[:, columns].eq(value).all(axis=1)]
Explanation
values = df.loc[:, columns]
selects only the columns we are interested in.
masks = values.eq(value)
gives a boolean data frame indicating equality with the target value.
mask = masks.all(axis=1)
applies an AND across columns (returning an index mask). Note that you can use masks.any(axis=1)
for an OR.
return df.loc[mask]
applies index mask to the data frame.
Demo
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 2, (100, 3)), columns=list('ABC'))
# both columns
assert np.all(filter(df, 1, 'A', 'B') == df[(df.A == 1) & (df.B == 1)])
# no columns
assert np.all(filter(df, 1) == df)
# different values per column
assert np.all(filter(df, [1, 0], 'A', 'B') == df[(df.A == 1) & (df.B == 0)])
Alternative
For a small number of columns (< 5), the following solution, based on steven's answer, is more performant than the above, although less flexible. As-is, it will not work for an empty columns
set, and will not work using different values per column.
from operator import and_
def filter(df, value, *columns):
return df.loc[reduce(and_, (df[column] == value for column in columns))]
Retrieving a Series
object by key (df[column]
) is significantly faster than constructing a DataFrame
object around a subset of columns (df.loc[:, columns]
).
In [4]: %timeit df['A'] == 1
100 loops, best of 3: 17.3 ms per loop
In [5]: %timeit df.loc[:, ['A']] == 1
10 loops, best of 3: 48.6 ms per loop
Nevertheless, this speedup becomes negligible when dealing with a larger number of columns. The bottleneck becomes ANDing the masks together, for which reduce(and_, ...)
is far slower than the Pandas builtin all(axis=1)
.