0

I would like to do this in R

My dataframe looks like this:

  lab  value1  value2
   wbc  7.0    6
   wbc  6.5    3
   rbc  3.5    2
   rbc  4.0    2
   plt  100    1
   plt  120    2
   wbc  5.0    2
   wbc  7.5    1
   rbc  4.1    0

I would like use the ecdf function to calculate quantiles for value1 and value2 grouped by lab

Final output:

lab  value1  value2 quantile1 quantile2
 wbc  7.0    6
 wbc  6.5    3
 rbc  3.5    2
 rbc  4.0    2
 plt  100    1
 plt  120    2
 wbc  5.0    2
 wbc  7.5    1
 rbc  4.1    0
d.b
  • 32,245
  • 6
  • 36
  • 77
teotjunk
  • 11
  • 1

1 Answers1

1

base R

cols <- c("value1", "value2")
dat[,paste0(cols, "_q")] <- ave(dat[,cols], dat$lab, FUN=function(z) lapply(z, function(y) ecdf(y)(y)))
dat
#   lab value1 value2  value1_q  value2_q
# 1 wbc    7.0      6 0.7500000 1.0000000
# 2 wbc    6.5      3 0.5000000 0.7500000
# 3 rbc    3.5      2 0.3333333 1.0000000
# 4 rbc    4.0      2 0.6666667 1.0000000
# 5 plt  100.0      1 0.5000000 0.5000000
# 6 plt  120.0      2 1.0000000 1.0000000
# 7 wbc    5.0      2 0.2500000 0.5000000
# 8 wbc    7.5      1 1.0000000 0.2500000
# 9 rbc    4.1      0 1.0000000 0.3333333

dplyr

library(dplyr)
dat %>%
  group_by(lab) %>%
  mutate_at(vars(value1, value2), list(quant = ~ ecdf(.)(.))) %>%
  ungroup()
# # A tibble: 9 x 5
#   lab   value1 value2 value1_quant value2_quant
#   <chr>  <dbl>  <int>        <dbl>        <dbl>
# 1 wbc      7        6        0.75         1    
# 2 wbc      6.5      3        0.5          0.75 
# 3 rbc      3.5      2        0.333        1    
# 4 rbc      4        2        0.667        1    
# 5 plt    100        1        0.5          0.5  
# 6 plt    120        2        1            1    
# 7 wbc      5        2        0.25         0.5  
# 8 wbc      7.5      1        1            0.25 
# 9 rbc      4.1      0        1            0.333

data.table

library(data.table)
cols <- c("value1", "value2")
datDT <- as.data.table(dat)
datDT[, (paste0(cols,"_q")) := lapply(.SD, function(z) ecdf(z)(z)), .SDcols = cols, by = .(lab) ]
datDT
#    lab value1 value2  value1_q  value2_q
# 1: wbc    7.0      6 0.7500000 1.0000000
# 2: wbc    6.5      3 0.5000000 0.7500000
# 3: rbc    3.5      2 0.3333333 1.0000000
# 4: rbc    4.0      2 0.6666667 1.0000000
# 5: plt  100.0      1 0.5000000 0.5000000
# 6: plt  120.0      2 1.0000000 1.0000000
# 7: wbc    5.0      2 0.2500000 0.5000000
# 8: wbc    7.5      1 1.0000000 0.2500000
# 9: rbc    4.1      0 1.0000000 0.3333333
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • wonder if `mutate_at(vars(value1, value2), list(quant = ~ rank(., ties.method = "max")/n()))` will be faster – d.b Dec 29 '20 at 05:44
  • In some rows of my dataframe I have all nas for values and this generates an error message. Anyway around it? – teotjunk Dec 29 '20 at 08:10
  • Now suppose I want to apply the function 0.6745*(val-median)/mae intead of ecdf . How do I amend the code? – teotjunk Dec 29 '20 at 08:31
  • Replace the `ecdf(z)(z)` with whatever you want to do with `z` (the values within each `lab` group. – r2evans Dec 29 '20 at 12:44
  • If you have rows with all `NA`, then use `function(z) if (any(!is.na(z))) ecdf(z)(z) else NA_real_` (or, perhaps a little more sloppily, `function(z) tryCatch(ecdf(z)(z), error=function(e) NA_real_)`. – r2evans Dec 29 '20 at 12:45