7

Let's say I have a long character string: pneumonoultramicroscopicsilicovolcanoconiosis. I'd like to use stringr::str_replace_all to replace certain letters with others. According to the documentation, str_replace_all can take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple it seems to do it iteratively, so the result is a replacement of the prelast iteration. I'm not sure this is the intended behaviour.

library(tidyverse)
text_string = "developer"
text_string %>% 
  str_replace_all(c(e ="X")) #this works fine
[1] "dXvXlopXr"
text_string %>% 
  str_replace_all(c(e ="p", p = "e")) #not intended behaviour
[1] "develoeer"

Desired result:

[1] "dpvploepr"

Which I get by introducing a new character:

text_string %>% 
  str_replace_all(c(e ="X", p = "e", X = "p"))

It's a usable workaround but hardly generalisable. Is this a bug or are my expectations wrong?

I'd like to also be able to replace n letters with n other letters simultaneously, preferably using either two vectors (like "old" and "new") or a named vector as input.

reprex edited for easier human reading

biomiha
  • 1,358
  • 2
  • 12
  • 25
  • Can you do the example on a smaller word? ... This one is confusing :) – Sotos Jan 09 '18 at 13:27
  • @sotos. Done. I initially wanted a very long word but I realise it's very difficult to read. – biomiha Jan 09 '18 at 13:45
  • 1
    Ok, then `chartr` will do just fine. Try `chartr('pe', 'ep', text_string)` – Sotos Jan 09 '18 at 13:55
  • Thanks @sotos but I really need to be able to replace 3 or 4 or n letters simultaneously, i.e. a generalisable use case. I only changed the reprex for easier human reading. – biomiha Jan 09 '18 at 14:09
  • You can do multiple letters too. Try `chartr('pedl', 'epf5', x)` – Sotos Jan 09 '18 at 14:10
  • The issue is not multiple letters in one pattern the issue is multiple letters being replaced in the same string. Very different behaviour between the two. – biomiha Jan 09 '18 at 14:13

4 Answers4

8

2023 Update

Back when I first answered this I had a thrown together R package that was just on my github. Since then, I've refined it substantially and it's now on CRAN and even used in other packages.

The readme and CRAN documentation spells all this out, but I understand how helpful code is on this page. The updated usage is based on passing in vectors of patterns and replacements. There's a recycle option that will allow you to supply a replacement list that's shorter than the pattern list and just keep cycling through it. You can also pass arguments to regexpr in the backend (e.g. fixed=TRUE)

install.packages('mgsub')
mgsub("developer", 
      pattern = c("e", "p"), 
      replacements = c("p", "e"))
#> [1] "dpvploepr"

Original Answer

I'm working on a package to deal with the type of problem. This is safer than the qdap::mgsub function because it does not rely on placeholders. It fully supports regex as the matching and the replacement. You provide a named list where the names are the strings to match on and their value is the replacement.

devtools::install_github("bmewing/mgsub")
library(mgsub)
mgsub("developer",list("e" ="p", "p" = "e"))
#> [1] "dpvploepr"

qdap::mgsub(c("e","p"),c("p","e"),"developer")
#> [1] "dpvploppr"
Mark
  • 4,387
  • 2
  • 28
  • 48
2

My workaround would be to take advantage of the fact that str_replace_all can take functions as an input for the replacement.

library(stringr)
text_string = "developer"
pattern <- "p|e"
fun <- function(query) {
    if(query == "e") y <- "p"
    if(query == "p") y <- "e"
    return(y)
}

str_replace_all(text_string, pattern, fun)

Of course, if you need to scale up, I would suggest to use a more sophisticated function.

Benjamin Schwetz
  • 624
  • 5
  • 17
  • Definitely the best answer so far. I'll look into a generalisable way of setting up the if statements. – biomiha Jan 09 '18 at 14:11
  • I guess it really depends on your use-case/where the information comes form but I guess some sort of table structure with one column for pattern and one with replacement would be best. You can find an example with colors and rgb values in the documentation of str_replace_all. Maybe that will inspire you :) – Benjamin Schwetz Jan 09 '18 at 14:38
  • 1
    this is a very nice solution. Instead of the somewhat clunky series of if statements, an option would be the use of `switch`: `fun <- function(query) switch(query, e = "p", p = "e")`. also no need to return(x) in that case – tjebo Feb 20 '21 at 18:20
1

The iterative behavior is intended. That said, we can use write our own workaround. I am going to use character subsetting for the replacement.

In a named vector, we can look up things by name and get a replacement value for each name. This is like doing all the replacement simultaneously.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "a", "b", "X", "X")
rules[chars]
#>   a   a   b   X   X 
#> "X" "X" "Y" "a" "a"

So here, looking up "a" in the rules vector gets us "X", effectively replacing "a" with "X". The same goes for the other characters.

One problem is that names without a match yield NA.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "Y", "Z")
rules[chars]
#>    a <NA> <NA> 
#>  "X"   NA   NA

To prevent the NAs from appearing, we can expand the rules to include any new characters so that a character is replaced by itself.

rules <- c(a = "X", b = "Y", X = "a")
chars <- c("a", "Y", "Z")
no_rule <- chars[! chars %in% names(rules)]
rules2 <- c(rules, setNames(no_rule, no_rule))
rules2[chars]
#>   a   Y   Z 
#> "X" "Y" "Z"

And that's the logic behind the following function.

  • Break strings to characters
  • Create a full list of replacement rules
  • Look up replacement values
  • Glue strings back together
library(stringr)

str_replace_chars <- function(string, rules) {
  # Expand rules to replace characters with themselves 
  # if those characters do not have a replacement rule
  chars <- unique(unlist(strsplit(string, "")))
  complete_rules <- setNames(chars, chars)
  complete_rules[names(rules)] <- rules

  # Split each string into characters, replace and unsplit
  for (string_i in seq_along(string)) {
    chars_i <- unlist(strsplit(string[string_i], ""))
    string[string_i] <- paste0(complete_rules[chars_i], collapse = "")
  }
  string
}

rules <- c(a = "X", p = "e", e = "p")
string <- c("application", "developer")
str_replace_chars(string, rules)
#> [1] "XeelicXtion" "dpvploepr"
TJ Mahr
  • 3,846
  • 1
  • 21
  • 22
  • Can you elaborate on why the behaviour would be intended? I understand the "break/replace/glue" approach and I've managed the same thing by using recode, so that's not an issue, however I'd like to use `stringr::str_replace_all` for the integration with other tidyverse packages. – biomiha Jan 09 '18 at 15:27
  • 1
    I inferred by looking at the source code and seeing that the source code uses `stringi::stri_replace_all()` with `vectorize_all` set to false. That means that the rules are applied iteratively. If `vectorize_all` were true, your example would have applied the rules separately but never merged them (`c("dpvploppr", "develoeer")`). That seems like a bad default setting because it returns more strings than the input. Also, the rules can be regular expressions which can conflict in which case there is no obvious way to merge the replacements together. – TJ Mahr Jan 09 '18 at 16:21
  • Great explanation. Thanks. – biomiha Jan 09 '18 at 16:38
0

There is probably an order in what the function does, so after replacing all c by s, you replace all s by c, only c remains .. try this :

long_string %>% str_replace_all(c(c ="X", s = "U"))  %>% str_replace_all(c(X ="s", U = "c"))
  • 3
    Thank you, but I feel this is not a generalisable workaround because you would need n new characters for each n you wanted to replace. Not only is that sometimes difficult, it also makes the code unwieldy. Please note my edit as per Sotos' comment. – biomiha Jan 09 '18 at 13:49