How to select observations that are within a certain quantile

Question

I have data (~1000 rows) that look like this:

head(data)

   alt   alb   alp alt_zscore alb_zscore alp_zscore
 <dbl> <dbl> <dbl>      <dbl>      <dbl>      <dbl>
1  11    2.60   9       -1.54      -7.82      -0.949
2  12    5.37  86.3     -1.45      -0.351      2.31 
3  15.7  4.67  28       -1.09      -2.24      -0.148
4   7    4.43 171.      -1.93      -2.89       5.87 
5  14.5  3.75  12       -1.20      -4.72      -0.822
6  17.5  3.70  82.5     -0.915     -4.86       2.15

Each variable column (i.e., alt, alb, alp) has a corresponding z-score column (alt_zscore, alb_zscore, alp_zscore).

In my previous code, I told R, for each z-scored column, if an observation is further than 1 standard deviation lower than the mean, to take the absolute value of the z-scored observation; if it does not meet the criteria, give it a 0. (The reason I do this is because later on in my code, I add up all those z-scored observations into another column.)

Here is my previous code:

name <- c("alt_zscore", "alb_zscore", "alp_zscore")

stdev <- 1

lf <- list(
  \(x) ifelse(x <= -stdev, abs(x), 0),
  \(x) ifelse(x <= -stdev, abs(x), 0),
  \(x) ifelse(x <= -stdev, abs(x), 0)
) %>% 
  setNames(name)

The reason I do this is so I can create a new column "total_score" that is the sum of all the z-scores that fit my criteria.

data <- data %>% 
  mutate(total_score = rowSums(across(all_of(name), ~ lf[[cur_column()]](.)), na.rm = TRUE))

Now, what I am trying to do is tell R, for each 'regular' column (i.e., here I'm referring to 'alt', not 'alt_zscore'), if an observation is less than the 25th percentile of that column, then take the absolute value of its corresponding z-scored column (alt_zscore); otherwise, give it a zero. Note: sometimes I will need to specify the 75th percentile, or the 25th OR the 75th, so I’m hoping for the code to be adjustable in that regard.

I'm trying to modify my existing code to perform this but haven't had luck. Any help would be greatly appreciated. Thank you!

Darren Tsai · Accepted Answer · 2023-07-29T17:33:16.283

4

You can use across() with cur_column() to search the corresponding z-score column.

library(dplyr)

df %>%
  mutate(across(alt:alp,
                ~ if_else(.x < quantile(.x, .25), abs(get(paste0(cur_column(), "_zscore"))), 0),
                .names = "{.col}_new"))

#    alt  alb   alp alt_zscore alb_zscore alp_zscore alt_new alb_new alp_new
# 1 11.0 2.60   9.0     -1.540     -7.820     -0.949    1.54    7.82   0.949
# 2 12.0 5.37  86.3     -1.450     -0.351      2.310    0.00    0.00   0.000
# 3 15.7 4.67  28.0     -1.090     -2.240     -0.148    0.00    0.00   0.000
# 4  7.0 4.43 171.0     -1.930     -2.890      5.870    1.93    0.00   0.000
# 5 14.5 3.75  12.0     -1.200     -4.720     -0.822    0.00    0.00   0.822
# 6 17.5 3.70  82.5     -0.915     -4.860      2.150    0.00    4.86   0.000

For your first task, the list of functions (lf) is not needed because all functions are the same.

df %>% 
  mutate(total_score = rowSums(
    across(alt_zscore:alp_zscore, ~ ifelse(.x <= -1, abs(.x), 0)), na.rm = TRUE
  ))

Data

df <- read.table(text =
"   alt   alb   alp alt_zscore alb_zscore alp_zscore
1  11    2.60   9       -1.54      -7.82      -0.949
2  12    5.37  86.3     -1.45      -0.351      2.31 
3  15.7  4.67  28       -1.09      -2.24      -0.148
4   7    4.43 171.      -1.93      -2.89       5.87 
5  14.5  3.75  12       -1.20      -4.72      -0.822
6  17.5  3.70  82.5     -0.915     -4.86       2.15")

edited Jul 29 '23 at 17:33

answered Jul 28 '23 at 05:38

Darren Tsai

32,117
5
21
51

1

Great answer! I have also wondered about this – stats_noob Jul 28 '23 at 05:40
What if the functions are not the same? IE, you have a mix of some rows where the function is <25th, others where it’s >75th, or others where it’s <25th OR >75th. Sorry - should have clarified that in the OG post. I’ll edit it. – burphound Jul 28 '23 at 06:04
@burphound If the criterion is the same for `alt`, `alb`, and `alp`, you can just modify the `if_else()` part inside `across()`. E.g. `~ if_else(.x < quantile(.x, .25) | .x > quantile(.x, .75), abs(get(paste0(cur_column(), "_zscore"))), 0)` – Darren Tsai Jul 28 '23 at 06:30
@DarrenTsai thank you for checking in! I am currently working through it but I anticipate my issue will be solved. I didn't provide this in my example df to keep it simple (maybe I should have) but I have other columns in between alt, alb, alp, etc that I won't be using in these calculations, so instead of using the across function I might have to just use an individual mutate argument for each column I want to use. Thank you again for your help! – burphound Jul 28 '23 at 17:39
@burphound [Tidy-selection](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html) provides many convenient syntaxes for selecting variables based on their names or properties. E.g. `alt:alp` means a range of consecutive columns between `alt` and `alp`. You can replace `across(alt:alp, ...)` with `across(c(var1, var2, etc), ...)` to specify which columns you want to use in your calculations. – Darren Tsai Jul 28 '23 at 17:50
@DarrenTsai Thank you so much! I got that working. Could you please explain why, in your last line of code in your posted answer, after ''' rowSums(across(alt:alp, ''' you put ''' ~ ifelse(.x <= -1, abs(.x), 0)) ''' ? – burphound Jul 28 '23 at 23:43
@DarrenTsai I think I mean, why choose if x<-1? For this part of the code, I want to add all the " _zscore" values. – burphound Jul 29 '23 at 16:41
@burphound oh that's typo! I fixed it. It should be `across(alt_zscore:alp_zscore, ...)` not `across(alt:alp, ...)`. – Darren Tsai Jul 29 '23 at 17:26

How to select observations that are within a certain quantile

1 Answers1

Data