0

I have a data frame that I want to remove any week that contains an outlier. I will be happy if I can indicate the entire week as an outlier, as I understand how to do the subset from there. I have not been able to come up with an appropriate solution. I keep thinking that I am going to need to loop through subsets of weeks to achieve the desired goal, or create a separate function to handle the individual outlier week and use sapply. I have yet to make either of these solutions viable.

date <- seq(as.Date("2015-01-01"), length=365, by="1 day")
dow <- as.factor(weekdays(as.Date(date))
df <- data.frame(cbind(date, dow))
df$date <- as.Date(df$date,format="%m/%d/%Y",origin="01/01/1970")
df$dow <- as.factor(weekdays(as.Date(df$date)))
set.seed(1115)
df$var1 <- rnorm(365, 1912, 40795)
stdev <- sd(df$var1, na.rm=TRUE)
avg <- mean(df$var1, na.rm=TRUE)
df$LB <- avg-(2.75*stdev)
df$UB <- avg+(2.75*stdev)
df$outlier <- ifelse(df$var1<df$LB | df$var1>df$UB, 1,0)
df$weeknum <- as.numeric(format(df$date, "%U"))
head(df, 17)

> head(df, 17)
         date       dow       var1        LB       UB outlier weeknum
1  2015-01-01  Thursday  -7828.412 -114675.6 120479.8       0       0
2  2015-01-02    Friday  25674.456 -114675.6 120479.8       0       0
3  2015-01-03  Saturday -33588.871 -114675.6 120479.8       0       0
4  2015-01-04    Sunday -54418.175 -114675.6 120479.8       0       1
5  2015-01-05    Monday -10002.002 -114675.6 120479.8       0       1
6  2015-01-06   Tuesday  34050.390 -114675.6 120479.8       0       1
7  2015-01-07 Wednesday -37584.648 -114675.6 120479.8       0       1
8  2015-01-08  Thursday  84048.878 -114675.6 120479.8       0       1
9  2015-01-09    Friday -24801.346 -114675.6 120479.8       0       1
10 2015-01-10  Saturday  33974.637 -114675.6 120479.8       0       1
11 2015-01-11    Sunday  77432.088 -114675.6 120479.8       0       2
12 2015-01-12    Monday 128196.236 -114675.6 120479.8       1       2
13 2015-01-13   Tuesday   9740.418 -114675.6 120479.8       0       2
14 2015-01-14 Wednesday  26539.887 -114675.6 120479.8       0       2
15 2015-01-15  Thursday  12172.834 -114675.6 120479.8       0       2
16 2015-01-16    Friday   1032.544 -114675.6 120479.8       0       2
17 2015-01-17  Saturday  76870.095 -114675.6 120479.8       0       2

In the above example, the desired output would be a 1 the outlier column in each row that corresponds with weeknum = 2.

Richard Lusch
  • 1,050
  • 10
  • 19
  • Something like this `df[ df$weeknum == 2 & df$outlier == 1, ]` ? – Roman Feb 08 '16 at 16:04
  • The only reason that weeknum = 2 should be subset is because the outlier happens to appear that week in row 12. The code I am looking to create will find the outlier point in any week and code the entire week as an outlier. The data set contains 365 rows, so the example above is just the first 17 rows, which happened to have an outlier in it. – Richard Lusch Feb 08 '16 at 18:22

2 Answers2

0

You say "the desired output would be a 1 the outlier column in each row that corresponds with weeknum = 2." Do you actually need an outlier column, then? Seems like you can simply subset your data.frame based on the values of the weeknum column, as follows:

df <- df[!(df$weeknum==2),]
mshum
  • 257
  • 3
  • 15
0

The answer involves testing two vectors. Once I realized this I was able to refine my search and find a suitable answer here.

The code necessary to properly identify each element was:

out.df <- df[which(df$outlier==1),]#Create a subset of only outlier rows
df$outlier <- ifelse(df$weeknum %in% out.df$weeknum, 1, 0)#Compare the new data frame
#weeknum against the old with the %in% operator, if they are equal leave 1, else 0.

This gave the result:

> head(df, 17)
         date       dow       var1        LB       UB outlier weeknum
1  2015-01-01  Thursday  -7828.412 -114675.6 120479.8       0       0
2  2015-01-02    Friday  25674.456 -114675.6 120479.8       0       0
3  2015-01-03  Saturday -33588.871 -114675.6 120479.8       0       0
4  2015-01-04    Sunday -54418.175 -114675.6 120479.8       0       1
5  2015-01-05    Monday -10002.002 -114675.6 120479.8       0       1
6  2015-01-06   Tuesday  34050.390 -114675.6 120479.8       0       1
7  2015-01-07 Wednesday -37584.648 -114675.6 120479.8       0       1
8  2015-01-08  Thursday  84048.878 -114675.6 120479.8       0       1
9  2015-01-09    Friday -24801.346 -114675.6 120479.8       0       1
10 2015-01-10  Saturday  33974.637 -114675.6 120479.8       0       1
11 2015-01-11    Sunday  77432.088 -114675.6 120479.8       1       2
12 2015-01-12    Monday 128196.236 -114675.6 120479.8       1       2
13 2015-01-13   Tuesday   9740.418 -114675.6 120479.8       1       2
14 2015-01-14 Wednesday  26539.887 -114675.6 120479.8       1       2
15 2015-01-15  Thursday  12172.834 -114675.6 120479.8       1       2
16 2015-01-16    Friday   1032.544 -114675.6 120479.8       1       2
17 2015-01-17  Saturday  76870.095 -114675.6 120479.8       1       2

Which is satisfactory.

Community
  • 1
  • 1
Richard Lusch
  • 1,050
  • 10
  • 19