1

I am new to python programming. I am trying to determine outliers in my data set. I have converted the data set to a pandas data frame and then applying the IQR principle. After that I want to replace my OUTLIERS by zero and then calculate the mean and standard deviation as the outliers as skewing the mean and SD.


The code for Data Set is as below:

import pandas as pd
data = [[123,100,1200,800,800,1200,900,1400],[246,15,16,45,15,45,11,55],[234,90,105,180,90,180,100,220],[236,100,90,9000,90,9000,70,140]]
df = pd.DataFrame(data,columns=['ID','Store1','Store2','Store3','Min','Max','Lower_Limit','Upper_limit'])
print (df)

Data Set Snippet:

    ID  Store1  Store2  Store3  Min   Max  Lower_Limit Upper_limit
  123     100    1200     800  800  1200          900        1400
  246      15      16      45   15    45           11        55
  234      90     105     180   90   180          100          220
  236     100      90    9000   90  9000           70          140

I want to update values of Store1,Store2,Store3 to ZERO(0) if they are less than Lower_limit(['Store1'] < ['Lower_limit']) or greater than Upper_limit(['Store1'] > ['Upper_limit']).


Below is my Function:

def calculate_Outliers(row):
    if row['Store1'] < row['Lower_limit'] or row['Store1'] > row['Upper_limit']:
        return 0
    else:
        return row['Store1']
    if row['Store2'] < row['Lower_limit'] or row['Store2'] > row['Upper_limit']:
        return 0
    else:
        return row['Store2']
    if row['Store3'] < row['Lower_limit'] or row['Store3'] > row['Upper_limit']:
        return 0
    else:
        return row['Store3']

I am applying it like this:

df['Store1','Store3','Store3'] = df.apply(calculate_Outliers, axis=1)

Below is the result which is wrong...

    ID  Store1 Store2 Store3(Store1 Store2 Store3)
ID                  
123 NaN NaN NaN NaN 1000
246 NaN NaN NaN NaN 15
234 NaN NaN NaN NaN 0
236 NaN NaN NaN NaN 0

Desired Result:
ID  Store1  Store2  Store3  Min Max Lower_Limit Upper_limit
123 100    1200     800     800 1200    900      1400
246 15     16       45      15  45      11       55
234 0      105      180     90  180    100       220
236 100    90       0       90  9000    70       140

Is there a way in which i can modify my original code to achieve this?

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
New_Coder
  • 23
  • 5

3 Answers3

1

Try this:

m=df.filter(like='Store').lt(df.Lower_Limit,axis=0)|df.filter(like='Store').\
                                                     gt(df.Upper_limit,axis=0)

df.update(df.where(~m,0).filter(like='Store'))
print(df)

    ID  Store1  Store2  Store3  Min   Max  Lower_Limit  Upper_limit
0  123       0    1200       0  800  1200          900         1400
1  246      15      16      45   15    45           11           55
2  234       0     105     180   90   180          100          220
3  236     100      90       0   90  9000           70          140

EDIT you can use iloc[] if column names doesnot have a common string:

m=df.iloc[:,1:4].lt(df.Lower_Limit,axis=0)|df.iloc[:,1:4].gt(df.Upper_limit,axis=0)
df.update(df.where(~m,0).iloc[:,1:4])
print(df)

    ID  Store1  Store2  Store3  Min   Max  Lower_Limit  Upper_limit
0  123       0    1200       0  800  1200          900         1400
1  246      15      16      45   15    45           11           55
2  234       0     105     180   90   180          100          220
3  236     100      90       0   90  9000           70          140

Wrapping in a function :

def calculate_Outliers(df):
    m1= df['Store1'].lt(df['Lower_limit'])|df['Store1'].gt(df['Upper_limit'])
    m2 = df['Store2'].lt(df['Lower_limit'])|df['Store2'].gt(df['Upper_limit'])
    m3= df['Store3'].lt(df['Lower_limit'])|df['Store3'].gt(df['Upper_limit'])
    df.loc[m1,'Store1']=0
    df.loc[m1,'Store2']=0
    df.loc[m1,'Store3']=0
    print(df)
calculate_Outliers(df)

    ID  Store1  Store2  Store3  Min   Max  Lower_limit  Upper_limit
0  123       0       0       0  800  1200          900         1400
1  246      15      16      45   15    45           11           55
2  234       0       0       0   90   180          100          220
3  236     100      90    9000   90  9000           70          140
anky
  • 74,114
  • 11
  • 41
  • 70
  • Hi @anky_91, This is an sample of data set . Original data set has 39 variables and the name of the variables is not common like the one mentioned in the example data set. They are extreme names which minimal string match. Can you help me with coming up with a more generic code. (PS: there are 3 million rows) – New_Coder Mar 10 '19 at 05:56
  • @New_Coder if column names doesnot have a common string so you cannot use `filter` , you can use `df.iloc[]` if you know the positions of the columns to update. Check EDIT. – anky Mar 10 '19 at 06:14
  • Hi @anky_91, It works thank you very much!!! It was really helpful!! I am new to OOPS and hence was trying it in a way where i can make code productionized as it will go in production. Can you let me know and help me to write a function using IF /ELSE the original one which i tried? – New_Coder Mar 10 '19 at 06:41
  • @New_Coder check EDIT – anky Mar 10 '19 at 07:15
  • Thank you @anky_91 in my function it was not accessing the columns and rows simultaneously? – New_Coder Mar 11 '19 at 00:58
0

The below function should do the work:

def calculate_outliers(df):
    df['Store1'][(df['Store1']<df['Lower_Limit']) | (df['Store1'] > df['Upper_limit'])] = 0
    df['Store2'][(df['Store2']<df['Lower_Limit']) | (df['Store2'] > df['Upper_limit'])] = 0
    df['Store3'][(df['Store3']<df['Lower_Limit']) | (df['Store3'] > df['Upper_limit'])] = 0
0
df.loc[(df['Store1']<df['Lower_Limit']) | (df['Store1']>df['Upper_limit']),['Store1'] ] = 0

and repeat for the other stores.

Baris Tasdelen
  • 316
  • 1
  • 5
  • Hi Baris, Getting same error : TypeError: 'str' object cannot be interpreted as an integer. KeyError: ('Lower_Limit', 'occurred at index 123') my data type is int only RangeIndex: 13 entries, 0 to 12 Data columns (total 9 columns): ID 13 non-null int64 store1 13 non-null int64 store2 13 non-null int64 store3 13 non-null int64 Min 13 non-null int64 Max 13 non-null int64 Lower_limit 13 non-null int64 Upper_limit 13 non-null int64 – New_Coder Mar 10 '19 at 05:46