0

Consider the following simplified data frame:

df<-data.frame(x1=c("A","B","C"),x2=c("K to B","K to B","K to B"))

I want to replace strings in x2 with NA (or "") in the rows where the x1 character cannot be found as part of x2. That is, the data frame should be corrected to:

df_corrected<-data.frame(x1=c("A","B","C"),x2=c("NA","K to B","NA"))

The actual dataset contains 95000 rows and many different expressions in x2. I've otherwise used Tidyverse to clean the data. I have tried using grepl() for searching for the x1 value in the x2-string, however I am having an issue doing this iteratively for each row (need function/forward loop?) and combining it with mutate(). I am also open to other options if better (e.g. sapply and base R? Or sqldf?)

Thanks a lot in advance!

LMc
  • 12,577
  • 3
  • 31
  • 43
Bettina
  • 35
  • 6

3 Answers3

3

Another way, which doesn't use rowwise():

library(dplyr)

df |> mutate(x2 = if_else(str_detect(x2, x1), x2, NA))
Mark
  • 7,785
  • 2
  • 14
  • 34
2

You could use grepl to check if x2 contains the string of x1 rowwise like this:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(x2 = case_when(grepl(x1, x2) ~ x2,
                        TRUE ~ "NA")) %>%
  ungroup()
#> # A tibble: 3 × 2
#>   x1    x2    
#>   <chr> <chr> 
#> 1 A     NA    
#> 2 B     K to B
#> 3 C     NA

Created on 2023-08-30 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 2
    I would recommend piping this output to `ungroup()` -- leaving it as a row-wise tibble often has unintended consequences for users. – LMc Aug 30 '23 at 15:09
  • 3
    @LMc, Fully agree with you, thanks for your comment! – Quinten Aug 30 '23 at 15:13
  • 1
    Ah of course; rowwise() instead of loops, when in Tidyverse! Thanks! I'll be using the other suggested solution with str_detect, but thank you for this suggestion with grepl() :) – Bettina Aug 31 '23 at 08:57
2
library(dplyr)
library(stringr)

df |>
  mutate(x2 = str_replace_all(x2, str_c(".*\\b", x1, "\\b.*"), "NA"))

I think the use of word boundaries (\\b) is necessary here so "B" is not matched in "A to Bee".

Here is a base R solution:

df$x2[mapply(grepl, df$x1, df$x2)] <- "NA"

# with word boundaries
df$x2[mapply(grepl, paste0(".*\\b", df$x1, "\\b.*"), df$x2)] <- "NA"

Note: you requested those elements be changed to "NA" which is different than NA (notice the lack of quotes). The former is just a character string the latter means something special in R. There is a lot more functionality to deal with NA (e.g. is.na()). I would recommend using NA not "NA".

LMc
  • 12,577
  • 3
  • 31
  • 43
  • 1
    Thank you very much for this very complete answer. In the actual data, the characters in x1 actually all consist of a letter and a number, so it isn't an issue. But very good to know how to do this for another time. Thanks! Regarding NA: I am still a bit new in R vocabulary, so thanks for specifying! I'll keep it in mind (and use NA) – Bettina Aug 31 '23 at 08:51