1

Here is a sample dataframe:

a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)

I'd like to be able to remove the second occurrence of the value in col a in col b.

Here is my desired output:

a      b
cat    my cat is a tabby and is a friendly cat
dog    walk the dog
mouse  the mouse is scared of the other

I've tried different combinations of gsub and some stringr functions, but I haven't even gotten close to being able to remove the second (and only the second) occurrence of the string in col a in col b. I think I'm asking something similar to this one, but I'm not familiar with Perl and couldn't translate it to R.

Thanks!

carozimm
  • 109
  • 7

4 Answers4

1

It takes a little work to build the right Regex.

P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\\2)")

sub(PAT, "\\1", b, perl=TRUE)
[1] "my cat is a tabby  and is a friendly cat"
[2] "walk the dog"                            
[3] "the mouse is scared of the other "   
G5W
  • 36,531
  • 10
  • 47
  • 80
1

I've actually found another solution that, though longer, may be clearer for other regex beginners:

library(stringr)
# Replace first instance of col a in col b with "INTERIM" 
df$b <- str_replace(b, a, "INTERIM")

# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")

# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)

# Trim "double" whitespace
df$b <- str_replace(gsub("\\s+", " ", str_trim(df$b)), "B", "b")


df
a            b
cat          my cat is a tabby and is a friendly cat
dog          walk the dog
mouse        the mouse is scared of the other
carozimm
  • 109
  • 7
0

You could do this...

library(stringr)
df$b <- str_replace(df$b, 
                    paste0("(.*?",df$a,".*?) ",df$a), 
                    "\\1")

df
      a                                       b
1   cat my cat is a tabby and is a friendly cat
2   dog                            walk the dog
3 mouse        the mouse is scared of the other

The regex finds the first string of characters with df$a somewhere in it, followed by a space and another df$a. The capture group is the text up to the space before the second occurrence (indicated by the (...)), and the whole text (including the second occurrence) is replaced by the capture group \\1 (which has the effect of deleting the second df$a and its preceding space). Anything after the second df$a is not affected.

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • @carozimm Note that my solution and G5W's solution do different things. Mine compares each `df$b` only to the `df$a` in the same row, whereas the other answer compares `df$b` to ALL words in the `df$a` column (so it will delete "dog" in "that cat is not a dog", for example). My solution also avoids leaving an extra space where the deleted word was. Hopefully this is the behaviour you wanted! – Andrew Gustar May 14 '18 at 16:05
0

Base R, split-apply-combine solution:

# Split-apply-combine: 

data.frame(do.call("rbind", lapply(split(df, df$a), function(x){

        b <- paste(unique(unlist(strsplit(x$b, "\\s+"))), collapse = " ")

        return(data.frame(a = x$a, b = b))

      }

    )

  ), 

  stringsAsFactors = FALSE, row.names = NULL

)

Data:

df <- data.frame(a = c("cat", "dog", "mouse"),
                 b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"), 
                 stringsAsFactors = FALSE)
hello_friend
  • 5,682
  • 1
  • 11
  • 15