0

I have a very large pandas DataFrame with several thousand codes and the cost associated with each one of them (sample):

data = {'code': ['a', 'b', 'a', 'c', 'c', 'c', 'c'],
        'cost': [10, 20, 100, 10, 10, 500, 10]}
df = pd.DataFrame(data)

I am creating a groupby object at the code level, i.e.,:

grouped = df.groupby('code')['cost'].agg(['sum', 'mean']).apply(pd.Series)

Now I really need to add a new column to this grouped DataFrame, determining the percentage of codes that have outlier costs. My initial approach was this external function (using iqr from scipy):

def is_outlier(s):
    # Only calculate outliers when we have more than 100 observations
    if s.count() >= 100:
        return np.where(s >= s.quantile(0.75) + 1.5 * iqr(s), 1, 0).mean()
    else:
        return np.nan

Having written this function, I added is_outlier to my agg arguments in the groupby above. This did not work, because I am trying to evaluate this is_outlier rate for every element in the cost series:

grouped = df.groupby('code')['cost'].agg(['sum', 'mean', is_outlier]).apply(pd.Series)

I attempted to use pd.Series.where but it does not have the same functionality as the np.where. Is there a way to modify my is_outlier function that has to take the cost series as argument in order to correctly evaluate the outliers rate for each code? Or am I completely off-path?

UPDATE Desired Result (minus the minimum observations requirement for this example):

>>> grouped

  code    sum    mean    is_outlier

0  'a'    110     55     0.5
1  'b'    20      20     0
2  'c'    530     132.5  0.25

Note: my sample is terrible in order for me to calculate outliers since I have 2, 1, and 4 observations respectively for each code. In the production data frame each code has hundreds or thousands of observations, each one with a cost associated. In the sample result above, the values for is_outlier mean that, for 'a' one out of the two observations has a cost in the outlier range, for 'c' one out of the four observations has a cost in the outlier range, etc - I am trying to recreate this in my function by assigning 1's and 0's as the result of np.where() and taking the .mean() of that

.apply(pd.Series) is needed in order to cast the <pandas.core.groupby.SeriesGroupBy object> resulting fromgroupbyinto a DataFrame.sis a pandas Series with all values ofcostfor eachcode, as generated from thegroupbyoperation (splitphase ofsplit-apply-combine`)

nvergos
  • 432
  • 3
  • 15
  • Please show desired result and how how you are running the *is_outlier* method and `apply(pd.Series)` is redundant. What is *s* argument? Entire dataframe? – Parfait May 25 '18 at 18:41
  • @Parfait I just edited my question to explain – nvergos May 25 '18 at 18:58
  • Checking outlier across both *sum* and *mean* columns? – Parfait May 25 '18 at 19:47
  • No, checking whether each individual row's `cost` is an outlier with respect to this `code`'s cost distribution, and then coming up with a single outlier rate per cost. For example, for `code 'c'` there are 4 observations: 10, 10, 500, 10. There is one outlier (500), so for `code 'c'` the `is_outlier` column has to have a value of `0.25` (1 of 4 observations) – nvergos May 25 '18 at 19:52

1 Answers1

0

Data used

# Loading Libraries
import pandas as pd;
import numpy as np;

# Creating Data set
data = {'code': ['a', 'b', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a'],
    'cost': [10, 20, 200, 10, 10, 500, 10, 10, 10, 10]}

df = pd.DataFrame(data)

Defining a function for calculating the proportion of outliers in a specified column

def outlier_prop(df,name,group_by):

    """
    @Packages required
    import pandas as pd;
    import numpy as np;

    @input
    df = original dataframe
    name = This is the name column for which you want the dummy list
    group_by = column to group by

    @output
    data frame with an added column 'outlier' containing the proportion of outliers
    """

    # Step 1: Create a dict of values for each group
    value_dict = dict()
    for index,i in enumerate(df[group_by]):
        if i not in value_dict.keys():
            value_dict[i] = [df[name][index]]
        else:
            value_dict[i].append(df[name][index])

    # Step 2: Calculate the outlier value for each group and store as a dict
    outlier_thres_dict = dict()
    unique_groups = set(df[group_by])
    for i in unique_groups:
        outlier_threshold = np.mean(value_dict[i]) + 1.5*np.std(value_dict[i])
        outlier_thres_dict[i] = outlier_threshold

    # Step 3: Create a list indicating values greater than the group specific threshold
    dummy_list = []
    for index,i in enumerate(df[group_by]):
        if df[name][index] > outlier_thres_dict[i]:
            dummy_list.append(1)
        else:
            dummy_list.append(0)

    # Step 4: Add the list to the original dataframe
    df['outlier'] = dummy_list

    # Step 5: Grouping and getting the proportion of outliers
    grouped = df.groupby(group_by).agg(['sum', 'mean']).apply(pd.Series)

    # Step 6: Return data frame
    return grouped

Calling the function

outlier_prop(df, 'cost', 'code')

Output

https://raw.githubusercontent.com/magoavi/stackoverflow/master/50533570.png

Avi Mago
  • 1
  • 3