0

I've been fighting trying to understand tidyeval and the use of quo, quos, sym, !!, !!! and the like. I made some attempts, but couldn't generalize my code so it accepts a vector of columns and applies text processing to those columns on a dataframe. My dataframe looks like this:

ocupation      tasks                 id 
 Sink Cleaner   Cleaning the sink    1
 Lion petter    Pet the lions        2

And my code looks like this:

stopwords_regex = paste(tm::stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = glue('\\b{stopwords_regex}\\b')


df = df %>% mutate(ocupation_proc = ocupation %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>% 
                     str_remove_all("[[:punct:]]") %>%  
                     str_squish(),
                   tasks_proc = tasks %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>%
                     str_remove_all("[[:punct:]]") %>% 
                     str_squish()) 

Which brings something like this:

ocupation      tasks               id    ocupation_proc  tasks_proc
Sink Cleaner   Cleaning the sink   1     sink cleaner   cleaning sink
Lion petter    Pet the lions       2      lion petter    pet lions

I'd like to turn this into a function process_text_columns(df, columns_list, new_col_names) Where in this case df=df, columns_list=c('ocupation', 'tasks') and new_col_names=c('ocupation_proc', 'tasks_proc'), (new_col_names might not even be necessary if I can do something like glue({colname}_proc) to name the new columns). From what I've gathered I'd need to use across, sym, quos and maybe !!! to generalize the function but anything I've tried has failed. Do you have any ideas?

Thanks

Juan C
  • 5,846
  • 2
  • 17
  • 51
  • 1
    They should be enough to make a reproducible example. The only thing I need is a function that can take n column names as arguments and processes text for those, so later I may directly use it in another dataframe – Juan C Aug 18 '21 at 22:29

1 Answers1

3

Does this work for you as expected? The "curly curly" operator introduced to rlang 0.4 in June 2019 helps simplify the "quote-and-unquote into a single interpolation step."

clean_steps <- function(a_column) {
  a_column %>%
    tolower() %>% 
    stringi::stri_trans_general("Latin-ASCII") %>% 
    str_remove_all(stopwords_regex) %>%
    str_remove_all("[[:punct:]]") %>% 
    str_squish()
}

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x))) %>%
    rename( !!new_col_names )
}

my_great_function(df, 
                  c(ocupation, tasks), 
                  c(ocu = "ocupation", tas = "tasks"))

Output

           ocu           tas id
1 sink cleaner cleaning sink  1
2  lion petter     pet lions  2

EDIT: To keep unprocessed columns and add processed with new names, easiest would be to use the .names argument of across:

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x), .names = "{.col}_proc"))
}

my_great_function(df, c(ocupation, tasks))


     ocupation             tasks id ocupation_proc    tasks_proc
1 Sink Cleaner Cleaning the sink  1   sink cleaner cleaning sink
2  Lion petter     Pet the lions  2    lion petter     pet lions
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • That works like a charm, thank you very much! I have many questions, but first of all I wanted to ask you if it's easily possible to create these columns as new columns with different names using this code? I guess something in `mutate` can be done to do so – Juan C Aug 18 '21 at 22:37
  • updated with approach from here: https://stackoverflow.com/questions/52482185/tidy-evaluation-when-renaming-columns-in-dplyr – Jon Spring Aug 18 '21 at 22:46
  • Thanks again! One of my other questions was about "curly curly" which is a great tool to know about, thanks for providing a source too. My last question is about `~clean_steps(.x)`, what does the tilde operator mean in this case? Also, I assume `.x` tells R to apply the function on each element of the list, right? – Juan C Aug 18 '21 at 22:48
  • Check out the help for `?dplyr::across` -- it specifies a few ways to refer to the functions you want to use. The tilde is borrowing syntax from another `tidyverse` package, `purrr`, where as you said the `.x` is the placeholder for the input to the function. – Jon Spring Aug 18 '21 at 23:23
  • I noticed the problem was the desired output I posted, it was wrong. I wanted to create new columns and preserve the old ones, which is what I edited in now. As of now I could do that by pre-creating a copy of the columns before the function runs and then applying the code which would give me the modified version, but this doesn't seem very straight-forward. Do you have any ideas to better achieve this same thing? Thanks ! – Juan C Aug 19 '21 at 01:04
  • See edit: `across` has a `.names` argument where we can manipulate the column names, e.g. by adding a suffix `_proc` – Jon Spring Aug 19 '21 at 03:05
  • Thanks for all the help Jon, very much appreciated! – Juan C Aug 19 '21 at 18:55