2

I have this dataset:

A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100) 
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)

As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group. Hence, the percentiles will consider X1 to X5 as ONE group. For this it occurs to me these steps:

  1. Replace the values of X1 to X5 with 1 to 100 (1 for each percentile). Remember, I'm not looking for the percentiles of each X, but for all X's as a whole.
  2. Delete the rows where the variables X1 to X5 contains 1 or 100

My attempt: (based on how to find percentiles, replace outliers with the 5th and 95th percentile, remove data greater than 95th percentile in data frame)

as.data.frame(sapply(select(DF, X1:X5), function (x) {
     qx <- quantile(x, probs = c(1:100)/100)
     cut(x, qx, labels = c(1:100))
}))

But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)

Any suggestions?

Braiam
  • 1
  • 11
  • 47
  • 78
Chris
  • 2,019
  • 5
  • 22
  • 67
  • Using `sapply` you are calculating quantile for each column separately. If you want to consider all the X variables as whole do you want to remove rows where value is outside this range `quantile(as.matrix(DF[3:7]), c(0.01, 0.99))` ? – Ronak Shah Jun 09 '21 at 03:08
  • 1
    You have 99 levels `nlevels(cut(DF$X1, quantile(DF$X1, probs = c(1:100)/100)))` and are assigning 100 labels. – thelatemail Jun 09 '21 at 03:13
  • @RonakShah yes, that's right. thelatemail do you know how can I create that additional level in order to match the 100 labels? (Because it's 100 percentiles). – Chris Jun 09 '21 at 03:35

2 Answers2

3

Using both across and c_across in dplyr, you may also do this-

Steps explained -

  • c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.
  • thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
  • Now only job remains is to to check these values with every other value in data. So I used here across directly.
  • Using across I built a lambda formula which starts with a twiddle and its argument is . only. This twiddle style formula ~ . is equivalent to function(x) x and the rest is clear.
DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    Could you explain a little bit what does the first bunch of code? Specifically how mutate actually creates the column without using '=', and the use of ~. as my intuition tell me it's not used as negation. Thanks! – Chris Jun 10 '21 at 02:13
  • 1
    @Chris, my pleasure. See the edit please. – AnilGoyal Jun 10 '21 at 02:30
1

You can try the following -

library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))

DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213