I have this dataset:
A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100)
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)
As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X
variables for the percentile computation and all X
variables as ONE group. Hence, the percentiles will consider X1
to X5
as ONE group. For this it occurs to me these steps:
- Replace the values of
X1
toX5
with 1 to 100 (1 for each percentile). Remember, I'm not looking for the percentiles of eachX
, but for all X's as a whole. - Delete the rows where the variables
X1
toX5
contains 1 or 100
My attempt: (based on how to find percentiles, replace outliers with the 5th and 95th percentile, remove data greater than 95th percentile in data frame)
as.data.frame(sapply(select(DF, X1:X5), function (x) {
qx <- quantile(x, probs = c(1:100)/100)
cut(x, qx, labels = c(1:100))
}))
But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A
and some_number
variables (in my real problem they are not two columns, but nearly 50)
Any suggestions?