1

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.

The size of the DataFrame, len(df.index), is around 680,000.

This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.

rows_boolean = pandas.Series([False] * len(df.index))

for point in points:    
    current_condition = ((df['A'] == point['a']) & (df['B'] == point['b']) 
        & (df['C'] >= point['c']) & (df['C'] < point['d']))
    rows_boolean = rows_boolean | current_condition                      

filtered = df.loc[rows_boolean, list_of_column_names]

The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.

Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).

Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.

Solutions I have considered

Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:

  • Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]}) 
    & (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])                

filtered = df.query(query)

My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.

  • Another possible solution could come from implementing something like this.

To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:

There should be one-- and preferably only one --obvious way to do it. (PEP 20)

JTanner
  • 64
  • 5
  • sounds like the bottleneck is your `for` loop, not accessing the rows. Have you done any profiling? – Paul H Aug 07 '20 at 23:01
  • As @JTanner has kindly clarified in the edit, the speed issue here is the code within the `for` loop. No, I haven't done any profiling - I'm very much a code optimisation novice! – Andrew Markham Aug 07 '20 at 23:44
  • 1
    Could you post a code example that runs and yield performance characteristics similar to your real world case? – emher Aug 08 '20 at 04:32

0 Answers0