3

In Python, a common way to filter a data frame is like this

df.loc[(df['field 1'] == 'a') & (df['field 2'] == 'b'), 'field 3']...

When df name is long, or when there are more filter conditions (only two in the above), the above line will be long naturally. Moreover, it is a bit tedious to have type out df name for each condition. In R or SQL, we don't really need to do that. So, my question is if there is a way to shorten the above line in Python. For example, is there a way that I don't have to write down df name in each condition? Thanks.

LaTeXFan
  • 1,136
  • 4
  • 14
  • 36

3 Answers3

3

We can write a function that accepts lists of field names and conditions, iterates in parallel to create masks for each name-condition pair in a generator expression, combines those masks with numpy.logical_and (since we want to replicate &; it would be logical_or for |) - skipping a redundant pair of parentheses due to a special syntax rule - and applies the result:

def matching(df, fields, values):
    return df.loc[np.logical_and(
        df[field] == value
        for field, value in zip(fields, values)
    )]

Which should then allow us to simplify the code:

matching(df, ['field 1', 'field 2', 'field 3'], ['a', 'b', 'c'])

Alternately, we could specify the input as field-value pairs (and skip the zip step), as keys and values of a dictionary (iterating over its .items()), etc. The generalization of arbitrarily many &s using numpy.logical_and is the key insight here; the rest is just software engineering.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
3

you can use the query method as follows which is has an SQL like syntax

df.query('field_1 == "a" and field_2 == "b" and field_3 > 3.2 ')

Here are the docs for it https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

  • What about adding links to your answer, e.g. docs for [`DataFrame.query()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) ? – hc_dev Dec 27 '22 at 09:46
2

You can use dataframe.eval() function. It can be done like below.

df[df.eval("(field 1 == 'a') & (field 2 == 'b')")]
Julia Meshcheryakova
  • 3,162
  • 3
  • 22
  • 42