3

I have a function that is intended to operate on data obtained from a variety of sources with many manual entry fields. Since I don't know what to expect for the layout or naming convention used in these files, I want it to 'scan' a data frame for columns with the character string 'fix', 'name', or 'agent', and mutate the column to a new column with name 'Firm', then proceed to do string cleaning on the entries of that column, then finally, remove the original column. I have gotten it to work with SOME of the CSVs that I have already, but now have run into this error: ONLY STRINGS CAN BE CONVERTED TO SYMBOLS. I have checked into this thread ERROR: Only strings can be converted to symbols but to no avail.

Here is the function at the moment:

clean_firm_names2 <- function(df){
  df <- df %>%
    mutate(Firm := !!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T)) %>% 
             str_replace_all(pattern = "(\\W)+"," ") %>% 
             ...str manipulations...
             str_squish()) %>%
    dplyr::select(-(!!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T))))
  return(df)
}

I have tried using as.character() around the grep() function but that did not solve the problem. I have looked at the CSV that the function is meant to operate on and all of the column names are character strings. I read in the CSV using vroom(), as with my other CSVs, and that works fine, all of the column names appear. I can perform other dplyr functions on the df, suggesting to me that the df is behaving normally otherwise. I have run out of ideas as to why the function is choking up only on SOME of my CSVs but works as intended on others. Has anyone run into similar issues or got any clues as to what might be causing this error? This is the first time I've used SO-- I'm sorry if this question isn't very clear. I'll try and edit as needed.

Thanks!

SGE
  • 311
  • 3
  • 10
  • To reiterate for others who may face this issue: as @Artem Sokolov suggests below, this code fails in cases where a vector of strings is returned, rather than a single string (a result, in this case, of the flexible regex). – SGE Aug 12 '20 at 22:23

1 Answers1

1

Note that grep() returns indices of the matches (integers), not the matches themselves (strings). Integer indices can be passed directly to dplyr::rename, so perhaps the following may work better?

i <- grep(pattern = '(AGENT)|(NAME)|(FIX)', x = colnames(df), ignore.case = T, value = T)
df <- df %>%
  rename(Firm = i) %>%
  mutate(Firm = ...str manipulations... )

(There is an implicit assumption here that your grep() returns a single index. Additional code may be required to handle multiple matches.)

Artem Sokolov
  • 13,196
  • 4
  • 43
  • 74
  • Thanks for the explanation as to why-- I tried this and also tried entering the grep command directly into `rename(Firm = grep(...))` for concision but couldn't get either to work. In both cases I get the following error: `Problem with 'mutate()'; input 'Firm'. object 'Firm' not found; Input is ''%>%'(...)'.` Furthermore, for some reason the previous code will work if I select only one of the possible match words, like "(NAME)" or "(FIX)" but fails with the option of either using ` | `. Could grep() be choking on this regex? – SGE Aug 12 '20 at 21:37
  • @SGE Two questions: 1) Does `grep()` return a single index when you give its output to `rename()`; 2) Does rename correctly output a dataframe where your agent/name/fix column is renamed to Firm? – Artem Sokolov Aug 12 '20 at 21:50
  • Oh boy. you are correct-- double checking Q1, I discovered that it was matching against another column (preFIX) in addition to the target column and that was causing the error. 2 executes successfully when I address the double match in 1. I am now curious how to make it act on each of the matching indices. Anyways, what's protocol here? If you have any suggestions regarding this I'd certainly appreciate the input! Though I suppose that's a different question altogether. Thanks, again. – SGE Aug 12 '20 at 22:14
  • @SGE If you want to avoid the matching of prefixes, consider using `^` to [designate the beginning of the string](https://stackoverflow.com/questions/38331632/how-to-match-the-start-and-end-of-an-expression-with-grep-in-r). – Artem Sokolov Aug 12 '20 at 23:28