3

I have a dataframe that is nested by groups. I want to convert variable 'x' from its raw value to quantile position (20%, 40%, 60%, 80%, 100% or 1, 2, 3, 4, 5).

Here is an example of the data I'm using:

df <- data.frame(x=c(1, 5, 21, 24, 43, 47, 56, 59, 68, 69, 11, 15, 25, 27, 48, 49, 51, 55, 61, 67),
                 y=c("A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"))

This is what I have tried:

df$z <- aggregate(df$x, by = list(df$y), FUN = function(x) quantile(x, probs = c(0.2, 0.4, 0.6, 0.8, 1), na.rm = T))

In essence, I would like to create a new variable that looks like this:

df$z <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
Marco Pastor Mayo
  • 803
  • 11
  • 25

2 Answers2

2

On a grouped data.frame you can use dplyr::ntile():

library(dplyr)

df %>%
  group_by(y) %>%
  mutate(z = ntile(x, 5))

# A tibble: 20 x 3
# Groups:   y [2]
       x y         z
   <dbl> <fct> <int>
 1     1 A         1
 2     5 A         1
 3    21 A         2
 4    24 A         2
 5    43 A         3
 6    47 A         3
 7    56 A         4
 8    59 A         4
 9    68 A         5
10    69 A         5
11    11 B         1
12    15 B         1
13    25 B         2
14    27 B         2
15    48 B         3
16    49 B         3
17    51 B         4
18    55 B         4
19    61 B         5
20    67 B         5
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • Great answer, thanks. Quick question, how can I extract the `z` column as a new column in the `df`? – Marco Pastor Mayo Nov 28 '19 at 21:56
  • 1
    @MarcoPastorMayo - As part of the pipe, you can do: `... %>% pull(z)`. – Ritchie Sacramento Nov 28 '19 at 21:59
  • Another comment. The `seq_along()` function is unnecessary for what I needed to do (it actually made it not work). `ntile()` is enough to get the string that matches the observations. – Marco Pastor Mayo Nov 28 '19 at 23:16
  • @MarcoPastorMayo - right, I misunderstood as you mentioned position and the vector was already sorted. Will edit. – Ritchie Sacramento Nov 28 '19 at 23:22
  • I have just realised that this answer actually gave me the quantiles for the whole vector, rather than the group quantiles. Suppose the data was `df<-data.frame(x=c(1:100), y=c(rep("A", 50), rep("B", 50)`. In this case it would be more clear that the function gives the quantiles in relation to the whole rather than for the group. I'd really appreciate the solution to this, thanks! – Marco Pastor Mayo Dec 18 '19 at 11:03
  • @MarcoPastorMayo - If the data has been grouped, then the result will be by group. If you've gotten a result for the vector as a whole, then you haven't grouped the data. – Ritchie Sacramento Dec 18 '19 at 20:53
  • Thanks. Turns out the `dplyr` package was causing the `group_by` function to not work. Once I stopped using it the function worked perfectly. – Marco Pastor Mayo Dec 20 '19 at 00:06
1

We can use cut with breaks as the quantile

library(dplyr)  
df %>%
   group_by(y) %>%
   mutate(z = as.integer(cut(x, breaks = c(-Inf, 
       quantile(x, probs = c(0.2, 0.4, 0.6, 0.8, 1), na.rm = TRUE)))))
# A tibble: 20 x 3
# Groups:   y [2]
#       x y         z
#   <dbl> <fct> <int>
# 1     1 A         1
# 2     5 A         1
# 3    21 A         2
# 4    24 A         2
# 5    43 A         3
# 6    47 A         3
# 7    56 A         4
# 8    59 A         4
# 9    68 A         5
#10    69 A         5
#11    11 B         1
#12    15 B         1
#13    25 B         2
#14    27 B         2
#15    48 B         3
#16    49 B         3
#17    51 B         4
#18    55 B         4
#19    61 B         5
#20    67 B         5

Or using base R with ave

with(df, ave(x, y, FUN = function(u) as.integer(cut(u, breaks = c(-Inf,
          quantile(u, probs = c(0.2, 0.4, 0.6, 0.8, 1), na.rm = TRUE))))))
#[1] 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5

NOTE: Answering based on the quantile question OP asked

akrun
  • 874,273
  • 37
  • 540
  • 662