0

I have to present some data to work colleagues and i am having issues analysing it in MySQL.

I have 1 table called 'payments'. Each payment has columns for:

  1. Client (our client e.g. a bank)
  2. Amount_gbp (the GBP equivalent of the value of the transaction)
  3. Currency
  4. Origin_country
  5. Client_type (individual or company)

I have written pretty simple queries like:

SELECT  
    AVG(amount_GBP), 
    COUNT(client) AS '#Of Results'
FROM payments

WHERE client_type = 'individual'
    AND amount_gbp IS NOT NULL
    AND currency = 'TRY'
    AND country_origin = 'GB'
    AND date_time BETWEEN '2017/1/1' AND '2017/9/1'

But what i really need to do is eliminate outliers from the average AND/OR only include results within a number of Standard Deviations from the Mean.

For example, ignore the top/bottom 10 results of 2% of results etc. AND/OR ignore any results that fall outside of 2 STDEVs from the Mean

Can anyone help?

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
Charlie Felix
  • 91
  • 1
  • 1
  • 8

1 Answers1

0

--- EDITED ANSWER -- TRY AND LET ME KNOW ---

Your best best is to create a TEMPORARY table with the avg and std_dev values and compare against them. Let me know if that is not feasible:

CREATE TEMPORARY TABLE payment_stats AS
 SELECT
  AVG(p.amount_gbp) as avg_gbp,
  STDDEV(amount_gbp) as std_gbp,
  (SELECT MIN(srt.amount_gbp) as max_gbp
    FROM (SELECT amount_gbp
     FROM payments
     <... repeat where no p. ...>
     ORDER BY amount_gbp DESC
     LIMIT <top_numbers to ignore>
   ) srt
  ) max_g,
  (SELECT MAX(srt.amount_gbp) as min_gbp
    FROM (SELECT amount_gbp
     FROM payments
     <... repeat where no p. ...>
     ORDER BY amount_gbp ASC
     LIMIT <top_numbers to ignore>
   ) srt
  ) min_g
 FROM payments
 WHERE client_type = 'individual'
  AND amount_gbp IS NOT NULL
  AND currency = 'TRY'
  AND country_origin = 'GB'
  AND date_time BETWEEN '2017/1/1' AND '2017/9/1';

You can then compare against the temp table

SELECT  
 AVG(p.amount_gbp) as avg_gbp, 
 COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
 p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2) 
                FROM payment_stats)
 AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2) 
                FROM payment_stats)
 AND p.amount_gbp > (SELECT min_g FROM payment_stats)
 AND p.amount_gbp < (SELECT max_g FROM payment_stats)
 AND p.client_type = 'individual'
 AND p.amount_gbp IS NOT NULL
 AND p.currency = 'TRY'
 AND p.country_origin = 'GB'
 AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';

-- Later on

DROP TEMPORARY TABLE payment_stats;

Notice I had to repeat the WHERE condition. Also change *2 to whatever <factor> to what you need!

Still Phew!

Each compare will check a different stat

Let me know if this is better

Jacques Amar
  • 1,803
  • 1
  • 10
  • 12
  • Hey, thank you for this!!! I have 2 questions: 1. Where do you put p. before the column selection in the WHERE query? 2. MySQL is telling me that the second WHERE query isnt correct - the syntax is unexpected. How do i solve this? – Charlie Felix Aug 22 '17 at 09:20
  • I think it is expected ON as the next command because we used LEFT JOIN... how do i do this? – Charlie Felix Aug 22 '17 at 09:35
  • If I understand your question, each subselect (within parenthesis) is areally an idependant query and needs no qualifiers. The subselect becomes a new table and the name that follows identifies this table. I see that I have an error in the subsequent subselect and I'll fix that – Jacques Amar Aug 24 '17 at 22:36
  • It really doesnt work without the ON command... what do you suggest? – Charlie Felix Sep 05 '17 at 15:39
  • I redid the answer with a temporary table that simplifies things. If you can't use a temp table, you'd have to repeat the query each time .. Try and let me know – Jacques Amar Sep 06 '17 at 02:15