1

I am cleaning some string data using some stringi functions as part of a pipe.

I would like these functions to be recursive, so that they tackle all the possible occurrences of a re, not only the first one. I cannot predict ex ante the number of times I would need to run the function to properly clean the data.

library(stringi)

test_1 <- "AAA A B BBB"
str_squish(str_remove(x, "\\b[A-Z]\\b"))
result <- "AAA B BBB"
desired <- "AAA BBB"

test_2 <- "AAA AA BBB BB CCCC"
str_replace(test_2,"(?<=\\s[A-Z]{2,3})\\s","")
result <- "AAA AABBB BB CCCC"
desired <- "AAA AABBB BBCCCC"
MCS
  • 1,071
  • 9
  • 23
  • 1
    For starters, try `str_remove_all`, and I think you mean `library(stringr)` and not `library(stringi)` – MrFlick Jul 07 '21 at 07:48

2 Answers2

1

I would suggest using base R's gsub here, which does a global regex replacement:

test_1 <- "AAA A B BBB"
result <- gsub("[ ]{2,}", " ", gsub("[ ]*\\b[A-Z]\\b[ ]*", " ", test_1))
result

[1] "AAA BBB"
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Maybe using gsub, which will perform replacement of all matches:

test_1 <- "AAA A B BBB"
gsub(" +", " ", gsub("\\b[A-Z]\\b", "", test_1))
#[1] "AAA BBB"

test_2 <- "AAA AA BBB BB CCCC"
gsub("(?<=\\s[A-Z]{2})\\s", "", test_2, perl=TRUE)
#[1] "AAA AABBB BBCCCC"

For the regex (?<=\\s[A-Z]{2,3})\\s its not clear when the condition of 2-3 should be observed and from where you are starting: E.g. stringr::str_replace_all would give:

stringr::str_replace_all(test_2,"(?<=\\s[A-Z]{2,3})\\s","")
#[1] "AAA AABBBBBCCCC"

Also you can use a recursive function call:

f <- function(x) {
  y <- stringr::str_replace(x, "(?<=\\s[A-Z]{2,3})\\s","")
  if(x == y) x
  else f(y)
}
f(test_2)
#[1] "AAA AABBB BBCCCC"
GKi
  • 37,245
  • 2
  • 26
  • 48
  • Thanks a lot. For the second function I split that into two, running first the 2-letters condition only and they the 3-letters condition only – MCS Jul 07 '21 at 08:31