How to apply a threshold to a pandas DataFrame column and output a row outside of the threshold?

Question

I have a large dataset of product families. I'm trying to catch any weird data entries where a price is too high/low than the other members of the family. For example, I have a this pandas.DataFrame:

df =
Prices    Product Family
0    1.99        Yoplait
1    1.89        Yoplait
2    1.59        Yoplait
3    1.99        Yoplait
4    7.99        Yoplait
5    12.99       Hunts 
6    12.99       Hunts 
7    2.99        Hunts 
8    12.49       Hunts

I want to write a for loop, that iterates through each Product Family, sets some kind of threshold which identifies which products are questionable (row 4 and row 7), and spits out that row. How can I do this?

So far I have this:

families = df['Product Family'].unique() 
for i in families: 
   if df['Prices] .....(set threshold)
   then.....(spit out that row that is questionable)

And then I would ideally finish off that if statement in the for loop, for each product family. Does anyone have an idea (or a better one) on how to set this threshold and finish off the code?

score 3 · Accepted Answer · answered Dec 28 '17 at 01:56

When using pandas it is best to not use loops if possible. In your case we can use groupby() to do operations of like Families. Here is one way to find your outliers by using values which differ from the group median:

Code:

df['median'] = df.groupby('Product_Family').transform('median')
df['outlier'] = ((df.Prices - df['median']) / df['median']).abs() > 0.5

Test Code:

import pandas as pd

df = pd.read_fwf(StringIO(u"""
    Prices      Product_Family
    1.99        Yoplait
    1.89        Yoplait
    1.59        Yoplait
    1.99        Yoplait
    7.99        Yoplait
    12.99       Hunts 
    12.99       Hunts 
    2.99        Hunts 
    12.49       Hunts"""),
                 skiprows=1)

df['median'] = df.groupby('Product_Family').transform('median')
df['outlier'] = ((df.Prices - df['median']) / df['median']).abs() > 0.5

print(df[df.outlier])    
print(df)

Results:

   Prices Product_Family  median  outlier
4    7.99        Yoplait    1.99     True
7    2.99          Hunts   12.74     True

   Prices Product_Family  median  outlier
0    1.99        Yoplait    1.99    False
1    1.89        Yoplait    1.99    False
2    1.59        Yoplait    1.99    False
3    1.99        Yoplait    1.99    False
4    7.99        Yoplait    1.99     True
5   12.99          Hunts   12.74    False
6   12.99          Hunts   12.74    False
7    2.99          Hunts   12.74     True
8   12.49          Hunts   12.74    False

This is great, makes sense! However I am getting an error on the line: — Hana, Dec 28 '17 at 04:29
df['median'] = df.groupby('Product_Family').transform('median') — Hana, Dec 28 '17 at 04:29
The error says: ValueError: Wrong number of items passed 2, placement implies 1 — Hana, Dec 28 '17 at 04:29
Under *Test Code*, I gave the exact to run. Start there, and then make changes, to see what breaks. — Stephen Rauch, Dec 28 '17 at 05:34

score 2 · Answer 2 · answered Dec 28 '17 at 02:15

Well I guess my way is similar to Stephen Rauch's. Only difference is that I standardize/normalize the prices per group.

# Standardize or normalize the `Prices` per `ProductFamily` (absolute value)
df_std = df.groupby('ProductFamily').transform(lambda x: np.abs((x - x.mean()) / x.std()))

# We assume that any Price beyond one standard deviation is an outlier
outlier_mask = df_std['Prices'] > 1.0

# Split clean and outlier dataframes
df_clean = df[~outlier_mask]
df_outlier = df[outlier_mask]

rnso · Answer 3 · 2017-12-28T05:26:24.677

One can also use quantiles for outlier detection with grouping and transformation as in other answers. Following uses 0.05 and 0.95 quantiles as limits:

# FIND LOWER AND UPPER LIMITS: 
df["lower"] = df.groupby("ProductFamily").transform(lambda x: x.quantile(0.05))
df["upper"] = df.iloc[:,0:2].groupby("ProductFamily").transform(lambda x: x.quantile(0.95))
print(df) 

# SELECT ROWS THAT MEET CRITERIA: 
df = df[(df.Prices > df.lower) & (df.Prices < df.upper)]
print(df)

# TO KEEP ORIGINAL 2 COLUMNS:
df = df.iloc[:,0:2]
print(df)

Output:

   Prices ProductFamily  lower  upper
0    1.99       Yoplait  1.650   6.79
1    1.89       Yoplait  1.650   6.79
2    1.59       Yoplait  1.650   6.79
3    1.99       Yoplait  1.650   6.79
4    7.99       Yoplait  1.650   6.79
5   12.99         Hunts  4.415  12.99
6   12.99         Hunts  4.415  12.99
7    2.99         Hunts  4.415  12.99
8   12.49         Hunts  4.415  12.99

   Prices ProductFamily  lower  upper
0    1.99       Yoplait  1.650   6.79
1    1.89       Yoplait  1.650   6.79
3    1.99       Yoplait  1.650   6.79
8   12.49         Hunts  4.415  12.99

   Prices ProductFamily
0    1.99       Yoplait
1    1.89       Yoplait
3    1.99       Yoplait
8   12.49         Hunts

How to apply a threshold to a pandas DataFrame column and output a row outside of the threshold?

3 Answers3

Code:

Test Code:

Results: