0

NOTE: I am new to R and appreciate any and all help. We all start somewhere!

I want to perform the following operation on a project aimed at increasing my R skills:

preferences[, (cols) := gsub(cols, pattern = "UN1", replacement = "A")]

cols is a vector containing only one entry: the name of the column called "Pref_1". My objective is for R to replace all occurrences of "UN1" in the column "Pref_1" with the letter "A". Running the above code does not work, and instead pastes "Pref_1" in every single entry:

    Pref_1
 1: Pref_1
 2: Pref_1
 3: Pref_1
 4: Pref_1
 5: Pref_1
 6: Pref_1
 7: Pref_1
 8: Pref_1
 9: Pref_1
10: Pref_1

Through experimentation, I discovered that if I run an otherwise identical code, but instead replace the cols (which contains the characters "Pref_1") with Pref_1, the code executes as intended:

preferences[, (cols) := gsub(Pref_1, pattern = "UN1", replacement = "A")]
                                                                               Pref_1
 1:                                                                                 A
 2:                                           Food and Agriculture Organization (FAO)
 3:         United Nations Educational, Scientific and Cultural Organization (UNESCO)
 4:                                       United Nations Development Programme (UNDP)
 5:                                                Commission on Narcotic Drugs (CND)
 6:                                                Commission on Narcotic Drugs (CND)
 7:                                                        Human Rights Council (HRC)
 8:                                                                                 A
 9:                                                        Human Rights Council (HRC)
10:                                                                                 A

Why wont gsub() allow me to reference the column I want to operate on using the cols object? The reason I am so persistent on using the cols object to reference the column I want to operate on is because I want to add column names to cols, and then run a for loop over the specified columns. Without using a cols vector to loop over, I would have to duplicate this code n times to operate over n columns.

r2evans
  • 141,215
  • 6
  • 77
  • 149
AaronSzcz
  • 145
  • 1
  • 8
  • A good place to start is the manual page for `gsub` (help(gsub) or ?gsub. The first argument is the pattern not the vector (which is referenced as x=. If you want a more complete answer, provide reproducible data using `dput()` and code showing your use of that data. – dcarlson Feb 02 '22 at 05:12
  • 1
    @dcarlson - naming the arguments as OP used will allow the order to work - `gsub("aaabbbccc", pattern="a", replacement="b")` – thelatemail Feb 02 '22 at 05:29
  • 1
    The issue is that `cols` isn't being treated as a reference to a column, but as the text inside `cols` - I think there's a duplicate on this but you can do something like `gsub(get(cols), ...` to get it to work. – thelatemail Feb 02 '22 at 05:38
  • 1
    If `cols` is plural, wouldn't it be `lapply(mget(cols), gsub, pattern="UN1", replacement="A")`? – r2evans Feb 02 '22 at 05:40
  • 2
    or more canonically, `preferences[, (cols) := lapply(.SD, gsub, pattern="UN1", replacement="A"), .SDcols=cols]`. – r2evans Feb 02 '22 at 05:51
  • 1
    @r2evans - indeed, the `.SDcols` option in particular is probably the way to go. – thelatemail Feb 02 '22 at 05:52
  • FYI @AaronSzcz, [tag:data.table] != [tag:datatable], I made the switch for you. Click on the tag names here to see what the difference is. – r2evans Feb 02 '22 at 05:53

1 Answers1

1

gsub is being given a vector of strings, and it does what it knows: works on the strings. It doesn't know that they should be an indirect reference. (Nothing will know that it should be indirect.)

You have two options:

  1. The canonical way in data.table for this is likely to use .SDcols.

    preferences[, (cols) := lapply(.SD, gsub, pattern = "UN1", replacement = "A"), .SDcols = cols]
    preferences
    #                                      Pref_1
    #                                      <char>
    #  1:                                       A
    #  2: Food and Agriculture Organization (F...
    #  3: United Nations Educational, Scientif...
    #  4: United Nations Development Programme...
    #  5:      Commission on Narcotic Drugs (CND)
    #  6:      Commission on Narcotic Drugs (CND)
    #  7:              Human Rights Council (HRC)
    #  8:                                       A
    #  9:              Human Rights Council (HRC)
    # 10:                                       A
    

    This does two things: (i) the use of .SDcols for iterating over a dynamic set of columns is preferred and faster, and allows programmatic determination of those columns (what you need); (ii) using lapply allows you to do this to one or more columns. If you know you'll always do just one column, this still works well with very little overhead.

  2. You can get/mget the data. This is the way to tell something to grab the contents of a variable identified in a string vector.

    If you know that you will always have exactly one column, then you can use get:

    preferences[, (cols) := gsub(get(cols), pattern = "UN1", replacement = "A")]
    

    If there is even a chance that you'll have more than one, I strongly recommend mget. (Even if you think you'll always have one, this is still safe.)

    preferences[, (cols) := lapply(mget(cols), gsub, pattern = "UN1", replacement = "A")]
    

Data

preferences <- setDT(structure(list(Pref_1 = c("UN1", "Food and Agriculture Organization (FAO)", "United Nations Educational, Scientific and Cultural Organization (UNESCO)", "United Nations Development Programme (UNDP)", "Commission on Narcotic Drugs (CND)", "Commission on Narcotic Drugs (CND)", "Human Rights Council (HRC)", "UN1", "Human Rights Council (HRC)", "UN1")), class = c("data.table", "data.frame"), row.names = c(NA, -10L)))
cols <- "Pref_1"
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you so much @r2evans, this code worked! Happy to be learning through trial and error one step at a time – AaronSzcz Feb 02 '22 at 15:18