0

I would like to combine cut with group_by, but it is not working out. I tried to follow the recommendations of this thread Using cut() with group_by() but still it did not work.

Here is a reproducible code:

library(dplyr)
set.seed(1)
df <- tibble(
  V1 = round(runif(1000,min=1, max=1000)),
  V2 = round(runif(1000, min=1, max=3)),
  V3 = round(runif(1000, min=1, max=10)))

df$V2 = as.factor(df$V2)
df$V3 = as.factor(df$V3)


df$split= cut(df$V1, quantile(df$V1, c(0, .2, .6, 1)), include.lowest = TRUE)

Here is how I successfully combined group_by with ntile function.

df=df %>% 
  group_by(V2,V3) %>%
  mutate(quartile_by_group = ntile(V1,4))

But that does not work when I combine it with cut. We can see clearly that we have dozens instead of only three categories.

df=df %>%    
  group_by(V2, V3) %>%    
  mutate(split_by_group = cut(V1, quantile(V1, c(0, .2, .6, 1)), include.lowest = TRUE))

table(df$split_by_group)
Jack
  • 813
  • 4
  • 17
  • 1
    This the expected output. You are doing quantiles by the interactions of groups of V2 and V3, so you should expect 3*10*3 = 90 groups – Maël Feb 15 '23 at 10:55
  • Why would you expect only three categories? – Maël Feb 15 '23 at 10:56
  • `quantile(V1, c(0, .2, .6, 1)` yields different cut points for each group, so this is the reason you get so many different values. What is your expected output? – TimTeaFan Feb 15 '23 at 10:58
  • V1 represent wage. V2 and V3 represent countries and time. When i use the quartile function, it split my sample by four quartile for each country and period separately. Thus i end up with four categories at the end each representing a different quartile. I would like to do the same but for three categories that are split from 0 to 20 percentile, 21 to 60, and 61 to 100. – Jack Feb 15 '23 at 11:00

1 Answers1

2

Use labels in cut to get the 3 categories, irrespective of the group:

df <- df %>%    
  group_by(V2, V3) %>%    
  mutate(split_by_group = cut(V1, quantile(V1, c(0, .2, .6, 1)), labels = 1:3, include.lowest = TRUE))

table(df$split_by_group)
#  1   2   3 
#213 388 399 
Maël
  • 45,206
  • 3
  • 29
  • 67