-3

I have a data set like below. Now my problem is many fold. For each combination of client, task and subtask I want to exclude the top 10% extreme values. I want 2 data sets in out put, one with the extreme values for all the combination and other one is the normal values for all the combinations.

client  task    subtask time
a   abc t1  12
a   abc t2  23
b   xyz t3  334
c   ijk t1  1
c   ijk t1  12
b   xyz t1  12
a   xyz t2  23
b   ijk t3  24
a   ijk t2  344
c   xyz t3  34343
b   ijk t2  34
c   xyz t3  34
a   xyz t1  23
c   ijk t1  223
a   ijk t1  23
b   xyz t3  21
b   ijk t1  45
a   xyz t2  23
c   ijk t3  45
shadow
  • 21,823
  • 4
  • 63
  • 77
Rakesh
  • 1
  • 1

2 Answers2

2

You can use quantile to find the 10 % highest values:

DF <- within(DF,
             extreme <- ave(time, #your values
                            client, task, subtask, #grouping factors
                            FUN = function(x) x > quantile(x, 0.9)))

Then use subsetting to extract the values you want.

Roland
  • 127,288
  • 10
  • 191
  • 288
0

You can also use the dplyr package to speed this up.

DF %>% 
  group_by(client, task, subtask) %>% 
  mutate(extreme = time > quantile(time, .9))
shadow
  • 21,823
  • 4
  • 63
  • 77