-1

I have a dataset of 10 participants. These participants conducted different walking tests and physiological responses (i.e., heart rate, breathing frequency) were collected.

I want to detect outliers with a 3 standard deviation rule in Stata. What is usually used is a moving window of 30 seconds. For a data point at time t, the mean and standard deviation are calculated for the values of t-15 until t+15. If the value at time t is higher than the mean + 3 SD or lower than mean - 3 SD, it is considered as an outlier.

What is the command for it? Also, what do I do with the outliers, how do I handle them?

I tried this command: egen stdvar = std(var)

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
Sophie
  • 1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community May 03 '23 at 04:00

2 Answers2

1

The summarize command can get the mean and standard deviation, and then you can use that to define outliers. It sounds like you want to recalculate the mean and standard deviation for each point in time; that can be done with a loop. Here is an example with one of Stata's built-in datasets so you can reproduce it:

sysuse sp500.dta
gen t=_n
gen outlier=0
*Getting averages and standard deviations for each block of 10 observations:
forv i=10(10)240 {
*Replace volume with the variable you are using
summarize volume if `i'>=t & `i'<t+10
gen avg`i' = r(mean)
gen stddev`i' = r(sd)
*Replace 2*stddev... with 3*stddev. In the example dataset, there were no outliers that were more than 3 standard deviations from the mean
replace outlier=1 if volume>avg`i'+2*stddev`i' & `i'>=t & `i'<t+10
replace outlier=1 if volume<avg`i'-2*stddev`i' & `i'>=t & `i'<t+10
}
Nick Cox
  • 35,529
  • 6
  • 31
  • 47
0

With a grateful nod to @matthewSwilson's helpful answer, this gets a little closer to what you're asking. I used 2 SD not 3 SD because otherwise no outliers are identified with this sandbox dataset. (Clearly, we can't use your data.)

sysuse sp500.dta, clear 

gen t = _n

* ssc install rangestat 
rangestat (mean) mean=volume (sd) sd=volume, int(t -5 5)

gen outlier = (volume > mean + 2*sd) | (volume < mean - 2*sd) 

line volume t || scatter volume t if outlier, legend(order(2 "outliers?") ring(0) pos(1)) 

Whether this rule is good (if you need a rule at all, I would suggest something based on median and IQR) and what you should do about outliers are wider questions, better suited for Cross Validated.

enter image description here

Nick Cox
  • 35,529
  • 6
  • 31
  • 47