Lazy way of selecting rows by multiple conditions

Question

Typical setup of columns having categorical values as following:

df = pd.DataFrame([('A', 'buy', 'sub'),
                  ('B', 'sell', 'prior'),
                  ('C', 'hold', 'sub'),
                  ('D', 'loan', 'none'),
                  ('A', 'hold', 'sub'),
                  ('A', 'buy', 'none')], columns=['name', 'action', 'class'])

    name     action   class
1    A          buy    sub
2    B          sell   prior    
3    C          hold   sub
4    D          loan   none
5    A          hold   sub
6    A          buy    none

Suppose I want to select rows with multiple conditions, from this thread, I can do this:

#selecting rows with name='A' AND action='buy'
df1 = df[(df['name']=='A') & (df['action']=='buy')]

#selecting rows with name='A' OR class='sub'
df2 = df[(df['name']=='A') | (df['class']=='sub')]

For a dataframe with more columns and more selecting conditions, that operation above gets to look ugly (i.e. long statements of & and |). I am looking to do something more compact, like this:

df1 = df[df[['name', 'action']] == ('A', 'buy')]
df2 = df[df[['name', 'class']] *** ('A', 'sub')]    *** is hybrid operator, maybe?

Any lazy ways to do this? If no such a way, feel free to suggest to close this question. Thanks.

score 3 · Accepted Answer · answered Aug 17 '21 at 22:03

You could use pandas query for this:

df.query('name == "A" and action=="buy"')

 name action class
0    A    buy   sub
5    A    buy  none

If you are certain of the columns that will be queried, you can shift them into the index and select via MultiIndexing:

df = df.set_index(['name', 'action'])
df = df.sort_index() # avoid performance issues 

df.loc(axis=0)['A', 'buy']
 
            class
name action      
A    buy      sub
     buy     none

score 2 · Answer 2 · answered Aug 17 '21 at 22:12

2

Here is a way to code like your first example of desired syntax:

df[df[['name', 'action']].apply(tuple, axis=1) == ('A', 'buy')]

Result:

  name action class
0    A    buy   sub
5    A    buy  none

answered Aug 17 '21 at 22:12

SeaBean

22,547
3
13
25

Thanks for the code. The ```pandas.query``` and ```MultiIndex``` seem to be able to solve my problem in a more generic way. I will probably explore that further. – Tristan Tran Aug 18 '21 at 16:39

score 2 · Answer 3 · answered Aug 17 '21 at 22:53

I can't think of a nice-syntax built-in way to do so, but here is a stab at a created function for doing so:

def select_and(df, cols, vals):
    res = pd.Series(True, dtype='bool', index=df.index)
    for col, val in zip(cols, vals):
        res = (res & (df[col] == val))
    return df[res]

You can then call with:

>>> select_and(df, cols=('name', 'action'), vals=('A', 'buy'))
  name action class
0    A    buy   sub
5    A    buy  none

You could write a corresponding select_or, or try to combine them into a larger more complicated function.

This is no longer using the getitem syntax (df[...]). You could instead have select_and return res instead of df[res], if you just wanted the boolean series for indexing. But then you would have to call like:

df[select_and(df, ('name', 'action'), ('A', 'buy'))]

Which feels less clean to me.

score 1 · Answer 4 · answered Aug 17 '21 at 22:33

You can make the column references less fiddly to type by changing for example:

df1 = df[(df['name']=='A') & (df['action']=='buy')]

To:

df1 = df[(df.name =='A') & (df.action =='buy')]

However, if you do have a column in your data called 'class' you cannot do that as it'll error. So this is invalid:

df2 = df[(df.name =='A') | (df.class == 'sub')]

Example:

    df2 = df[(df.name =='A') | (df.class == 'sub')]
                                       ^
SyntaxError: invalid syntax

Change the column name (if possible) and you'll have no problem.

Thanks for the advice. I did not realize ```class``` is a keyword. — Tristan Tran, Aug 18 '21 at 16:43

score 1 · Answer 5 · answered Aug 17 '21 at 23:28

Lazy way of selecting rows by multiple conditions

I guess for a truly lazy way, and if you are using Jupyter, you could consider qgrid

An Interactive Grid for Sorting and Filtering DataFrames in Jupyter Notebook

Code:

# if required (terminal commands):
# pip install qgrid
# pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install 
# jupyter nbextension enable --py --sys-prefix qgrid

import pandas as pd
import qgrid

df = pd.DataFrame([('A', 'buy', 'sub'),
                  ('B', 'sell', 'prior'),
                  ('C', 'hold', 'sub'),
                  ('D', 'loan', 'none'),
                  ('A', 'hold', 'sub'),
                  ('A', 'buy', 'none')], columns=['name', 'action', 'class'])

qgrid_widget = qgrid.show_grid(df, show_toolbar=True)
qgrid_widget

Screenshot:

Thanks for the suggestion. This is new and helpful info with Jupyter for me. — Tristan Tran, Aug 18 '21 at 16:40

Lazy way of selecting rows by multiple conditions

5 Answers5