0

Suppose I have a groupby object, a DataFrame, or anything else with an apply() method. I want some elements to not map to any output. For example, in my case I have a groupby and I want groups that satisfy a certain criteria to be ignored. How can I do that? I've tried return None in the function being applied, but the output still has an entry for the group (it's null but it's still there).

For example, suppose a DataFrame looks like this:

good_row            272.0  42440.0  29893408.0
good_row_2          142.0  22360.0  12965953.0
bad_row             171.0  26920.0  14726556.0

I want to run df.apply(fn, axis=1) such that for good rows, fn returns some output, and for the bad row, fn tells apply to "ignore" the row, and the output does not have an entry for bad_row. Here I used a DataFrame rather than a groupby for ease of demonstration but it's the same idea.

Bluefire
  • 13,519
  • 24
  • 74
  • 118

2 Answers2

1

You could return pd.Series(index=['output_column1', 'output_column2', ...]) instead of None and then remove rows that are all NaN values like so:

cleaned_output_df = output_df.drop_na(axis=0, how='all')

Alternatively if you know in advance which rows the ones you want not to apply your function to to begin with, you can filter those out before using apply.

df.loc[boolean_array].apply(your_function_goes_here)

or

df.query("column_a > 15").apply(your_function_goes_here)

You can also filter groupby objects using their filter function, see the docs for an example. The syntax looks like this:

grouped = df.groupby('column_A')
filtered = grouped.filter(some_function_that_takes_a_df_and_returns_a_bool)
tobsecret
  • 2,442
  • 15
  • 26
1

Filter your dataframe first, then apply your function to the filtered results.

Let's say that the criterion that differentiates good rows from bad rows is that the ratio of the second column to the third column is less than or equal to 0.0018. Let's also say that you want to square all the values in all cells (that meet the criterion). You could use the following code:

import pandas as pd
df = pd.DataFrame(data=[
        {'a': 272., 'b': 42440., 'c': 29893408.},
        {'a': 142., 'b': 22360., 'c': 12965953.},
        {'a': 171., 'b': 26920., 'c': 14726556.}
    ], index=[
        'good_row',
        'good_row_2',
        'bad_row'
    ])

# One line, operator chaining
df[df['b'] / df['c'] <= 0.0018].apply(pow, args=(2,), axis=1)

# Three lines with intermediate objects
good_row_index = df['b'] / df['c'] <= 0.0018
filtered_df = df[good_row_index]
filtered_df.apply(pow, args=(2,), axis=1)
bsterrett
  • 11
  • 4