0

I have a messy factor variable, which contains various very similar factor levels (e.g. introduced by spelling mistakes, slightly different wordings etc.). I'm trying to combine the factor into four main categories using the fct_collapse function from the forcats package.

However, given the amount of variability I wanted to combine the fct_collapse function with the selection helpers from tidy selct such as starts_with() and contains().

Here a simple reproducible example: A single factor column with different levels that I would like to reduce to two factor levels "a" and "b".

 factor_df<-tibble(my_factor=factor(c("a_1","a_2","a_x","a_factor","a_factor","also_factor_a", 
                                      "1_b_1","2_b_2","xx_b_xx")))

Instead of listing every single factor I would like to use the selection helpers to do that for me where possible. However the following code throws an error:

factor_df%>%
            mutate(new_fct=fct_collapse(factor_df$my_factor,
                                        a=c(starts_with("a_"), "also_factor_a"),
                                        b=c(tidyselect::contains("_b_"))))

Error: starts_with() must be used within a selecting function. i See https://tidyselect.r-lib.org/reference/faq-selection-context.html.

(The link is not overly helpful.) How can this be done, using the helper functions?

Rasul89
  • 588
  • 2
  • 5
  • 14
  • The select helper functions only work for selecting columns. Use the stringr function for general string matching. – MrFlick Sep 29 '21 at 19:04

1 Answers1

0

starts_with is from dplyr and it is looking for column names and not for values in columns. We may either use grep or startsWith

library(dplyr)
library(forcats)
factor_df %>% 
   mutate(new_fct = fct_collapse(my_factor,
     a = c(levels(my_factor)[startsWith(levels(my_factor), "a_")], 
      "also_factor_a"), b = grep("_b_", levels(my_factor), value = TRUE)))

-output

# A tibble: 9 × 2
  my_factor     new_fct
  <fct>         <fct>  
1 a_1           a      
2 a_2           a      
3 a_x           a      
4 a_factor      a      
5 a_factor      a      
6 also_factor_a a      
7 1_b_1         b      
8 2_b_2         b      
9 xx_b_xx       b   
akrun
  • 874,273
  • 37
  • 540
  • 662