0

I currently have a dataset that contains transactional information, including dates, an amount field, and descriptive fields. I am looking to do simple statistical analysis (mean, median, etc.), but need to remove outliers first. The issue I'm having is removing the outliers while grouping the data by Region, City, and Date. I'm sure this could be done more easily using a robust programming language (R, Python, etc.), but I have to use SQL (SSMS) for this task.

Simple dataset example:

| Region | City    | Date       | Amount |
| ------ | ------- | ---------- | ------ |
| SW     | Phoenix | 2021-10-01 | 400    |
| NE     | Boston  | 2021-10-03 | 20     |
| SW     | Phoenix | 2021-10-03 | 800    |
| SW     | Phoenix | 2021-10-02 | 425    |
| NE     | Boston  | 2021-10-01 | 500    |
| SW     | Phoenix | 2021-10-02 | 15     |
| SW     | Phoenix | 2021-10-04 | 100    |
| NE     | Boston  | 2021-10-04 | 35     |
| SE     | Orlando | 2021-10-02 | 300    |

Initially I wanted to use IQR method, but I'm thinking using standard deviation (STDEV) will be easier. However, I'm running into issues with the grouping once I include the STDEV code.

Here is the current code as I have it:

WITH CTE_data AS (
SELECT 
     Region
    ,City
    ,Date
    ,Amount
FROM OrderTable
)

SELECT 
     Region
    ,City
    ,MAX(Date) AS MaxDate
    ,MIN(Date) AS MinDate
    ,AVG(Amount) AS AvgAmt
    ,STDEV(Amount) AS StedvAmt
FROM CTE_data
GROUP BY Region, City 

Given the sample dataset, I'd like to ignore the amounts that fall outside of ((Amount > (AvgAmt - StedvAmt) AND Amount < (AvgAmt + StedvAmt)). My end goal is to have an average amount for each Region/City, as well as an average amount (excluding the outliers). My desired output would be:

| Region | City    | MinDate    | MaxDate    | AvgAmt | AvgAmt_rem |
| ------ | ------- | ---------- | ---------- | ------ | ---------- |
| SW     | Phoenix | 2021-10-01 | 2021-10-04 | 348    | 308.33     |
| NE     | Boston  | 2021-10-01 | 2021-10-04 | 185    | 27.5       |
| SE     | Orlando | 2021-10-02 | 2021-10-02 | 300    | 300        |

Please note: for simplicity I didn't do the actual standard deviation calculation for my desired output, I just removed the outliers based on looking at the values (800 & 15 for Phoenix and 500 for Boston).

2 Answers2

2

You can do that by doing your calculation twice, the first time only to use as a filter, the second time to exclude outliers and recalculate the new average

with CTE_data as ( --Your original data
    SELECT * FROM ( VALUES 
        ('SW', 'Phoenix', CONVERT(DATE,'2021-10-01'), CONVERT(DECIMAL(18,6),'400'))
        , ('NE', 'Boston', '2021-10-03', '20')
        , ('SW', 'Phoenix', '2021-10-03', '800')
        , ('SW', 'Phoenix', '2021-10-02', '425')
        , ('NE', 'Boston', '2021-10-01', '500')
        , ('SW', 'Phoenix', '2021-10-02', '15')
        , ('SW', 'Phoenix', '2021-10-04', '100')
        , ('NE', 'Boston', '2021-10-04', '35')
        , ('SE', 'Orlando', '2021-10-02', '300')
    ) as TabA(Region, City, ReadingDate, Amount) 
), cteStats as ( --Calculate the Avg & Std of the raw data
    SELECT 
        Region
        , City
        , AVG(Amount) AS AvgRaw --ADD BELOW CASE to keep singleton from filtering itself
        , CASE WHEN STDEV(Amount) > 0.01 THEN STDEV(Amount) ELSE 0.01 END AS StedRaw
    FROM CTE_data
    GROUP BY Region, City 
), ctFiltered as ( --do it again but this time filter out the outliers
    SELECT 
        D.Region
        ,D.City
        ,MIN(ReadingDate) AS MinDate
        ,MAX(ReadingDate) AS MaxDate
        ,AVG(Amount) AS AvgFiltered
        ,STDEV(Amount) AS StedvFiltered
    FROM CTE_data as D --Your original data again, but tied to the stats and filtered on the stats
        INNER JOIN cteStats as S on D.Region = S.Region AND D.City = S.City 
    WHERE D.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
    GROUP BY D.Region, D.City 
)
SELECT D.Region, D.City , D.MinDate , D.MaxDate
    , CONVERT(decimal(18,2),S.AvgRaw) as AvgAmt
    , CONVERT(decimal(18,2),D.AvgFiltered) as AvgAmt_rem
    --, S.StedRaw as SDevPreFilter, D.StedvFiltered
FROM ctFiltered as D --Tie the filtered back to the stats so you can see the filter criteria
    INNER JOIN cteStats as S on D.Region = S.Region AND D.City = S.City 

EDIT: Add a CASE statement to test if the standard deviation is zero (a single value for that city) and replace it with a small StdDev, which won't matter because there can't be outliers to begin with if there is a single value!

Robert Sheahan
  • 2,100
  • 1
  • 10
  • 12
  • This is great! I'm running into a new issue, however. When I add a new single record (e.g. 'SE', 'Orlando', '2021-10-02', '350'), it isn't showing up in the last SELECT statement. I've tried changing the joins to LEFT OUTERs, but still only seeing Phoenix and Boston. – Josh Berner Aug 31 '22 at 19:32
  • I updated my initial question to include a single record in the sample dataset, and how I need it to appear in the desired output. Thanks! – Josh Berner Aug 31 '22 at 19:50
  • @JoshBerner, the standard deviation of a single datum is zero so it's filtering the one out, I'll look into a way to allow a wider window and post back either later today or tomorrow morning – Robert Sheahan Aug 31 '22 at 20:57
  • @JoshBerner, the code now has a test to see if the standard deviation is zero and if so, replace it with a small deviation. Since stddev=0 can only happen if there are no outliers to begin with, this will not affect the accuracy of the results. Let me know if this is not clear and I'll explain better – Robert Sheahan Aug 31 '22 at 21:45
2

It's not necessary to use a self-join. You can do this with a single scane of the base table, using window functions

WITH cteStats as ( --Calculate the Avg & Std of the raw data
    SELECT 
         *
        ,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
        ,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
    FROM CTE_data
)
SELECT 
         S.Region
        ,S.City
        ,MIN(S.ReadingDate) AS MinDate
        ,MAX(S.ReadingDate) AS MaxDate
        ,AVG(S.Amount) AS AvgFiltered
        ,STDEV(S.Amount) AS StedvFiltered
FROM cteStats as S
WHERE s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
GROUP BY s.Region, S.City;

db<>fiddle

You can also show both with and without the filtered rows, by using conditional aggregation

WITH cteStats as ( --Calculate the Avg & Std of the raw data
    SELECT 
         *
        ,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
        ,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
    FROM CTE_data
)
SELECT 
         S.Region
        ,S.City
        ,MIN(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
               THEN S.ReadingDate END) AS MinDate
        ,MAX(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
               THEN S.ReadingDate END) AS MaxDate
        ,AVG(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
               THEN S.Amount END) AS AvgFiltered
        ,AVG(S.Amount) AS AvgAll
FROM cteStats as S
GROUP BY s.Region, S.City;

db<>fiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • What you've provided in the second option is very close to what I'm looking for. One thing I'd like, if it's possible: getting the MinDate/MaxDate for a record that's filtered out. For example: The record ('NE', 'Boston', '2021-10-01', '500') get's filtered out, but I'd like the '2021-10-01' date to show in the MinDate for the output. Is this possible? – Josh Berner Aug 31 '22 at 20:53
  • Then just `MIN(S.ReadingDate) AS MinAllDates` etc. Essentially that second query can use normal aggregation to aggregate over all rows, and use conditional aggregation for just the rows you want to filter – Charlieface Aug 31 '22 at 20:54