Replace and delete first and last percentile in dataframe or multiple columns at once

Question

I have this dataset:

A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100) 
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)

As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group. Hence, the percentiles will consider X1 to X5 as ONE group. For this it occurs to me these steps:

Replace the values of X1 to X5 with 1 to 100 (1 for each percentile). Remember, I'm not looking for the percentiles of each X, but for all X's as a whole.
Delete the rows where the variables X1 to X5 contains 1 or 100

My attempt: (based on how to find percentiles, replace outliers with the 5th and 95th percentile, remove data greater than 95th percentile in data frame)

as.data.frame(sapply(select(DF, X1:X5), function (x) {
     qx <- quantile(x, probs = c(1:100)/100)
     cut(x, qx, labels = c(1:100))
}))

But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)

Any suggestions?

Using `sapply` you are calculating quantile for each column separately. If you want to consider all the X variables as whole do you want to remove rows where value is outside this range `quantile(as.matrix(DF[3:7]), c(0.01, 0.99))` ? — Ronak Shah, Jun 09 '21 at 03:08
You have 99 levels `nlevels(cut(DF$X1, quantile(DF$X1, probs = c(1:100)/100)))` and are assigning 100 labels. — thelatemail, Jun 09 '21 at 03:13
@RonakShah yes, that's right. thelatemail do you know how can I create that additional level in order to match the 100 labels? (Because it's 100 percentiles). — Chris, Jun 09 '21 at 03:35

AnilGoyal · Accepted Answer · 2021-06-10T02:30:25.390

Using both across and c_across in dplyr, you may also do this-

Steps explained -

c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.
thereafter two quantiles of this data will be deduced. (which will be scalar quantities)
Now only job remains is to to check these values with every other value in data. So I used here across directly.
Using across I built a lambda formula which starts with a twiddle and its argument is . only. This twiddle style formula ~ . is equivalent to function(x) x and the rest is clear.

DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired

Could you explain a little bit what does the first bunch of code? Specifically how mutate actually creates the column without using '=', and the use of ~. as my intuition tell me it's not used as negation. Thanks! — Chris, Jun 10 '21 at 02:13

score 1 · Answer 2 · answered Jun 09 '21 at 03:41

1

You can try the following -

library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))

DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))

answered Jun 09 '21 at 03:41

Ronak Shah

377,200
20
156
213

Replace and delete first and last percentile in dataframe or multiple columns at once

2 Answers2