How to match a word after a certain character with out knowing the word in R?

Question

I would to like to match the word after a - in my text then if that matched word is the end of another word then I would like to do a split between the word and the matched word.

Example of the text:

JOHN LION - XYZ RAN RUN TREEABC GRASS - ABC LIMB RAN RUN LION -XYZ JOG SUN
SKY - ABC LION JOHN PONDABC RUN - PDF STONE

what I would like the text to look like:

JOHN LION - XYZ RAN RUN TREE ABC GRASS - ABC LIMB RAN RUN LION -XYZ JOG SUN
SKY - ABC LION JOHN POND ABC RUN - PDF STONE

I do not not want to do a grepl and a gsub on ABC because the word after the dash is always changing and will appear multiple times. Also the word that is in front of the matched word will also always be different and will not always be TREE. No matter what the word is in front of the matched word I always want to do a split.

If I do the following str_extract:

str_extract(df, "(?<=-\\s)\\w+")

Then I match XYZ not ABC.

I only want to match the word after the - if it is also at the end of another word, but again I do not know what that other word will be.

I am stuck as what to do. Please let me know if any further information is needed. Any help will be greatly appreciated.

I see all sorts of problems with this, such as what happens if the post dash matched word appears more than once? Are there any instances where you would _not_ want to split up a matched word? — Tim Biegeleisen, Feb 05 '16 at 04:01
@Tim thank you for your comment I have edited my question. But the post dash matched word will appear many times, and I always want to do a split no matter what the word in front of the matched word will be. — Dre, Feb 05 '16 at 04:29

score 3 · Accepted Answer · answered Feb 05 '16 at 05:18

Here's one mildly hacky way. Let's call the data s:

s <- 'JOHN LION - XYZ RAN RUN TREEABC GRASS - ABC LIMB RAN RUN LION -XYZ JOG SUN SKY - ABC LION JOHN PONDABC RUN - PDF STONE'

With stringr, let's use your existing regex to extract the patterns to be matched:

library(stringr)
pat <- str_extract_all(s, "(?<=-\\s)\\w+")

Use those patterns to find all the words with non-whitespace characters before the pattern and a space after (i.e. the words that need spaces):

words <- str_extract_all(s, paste0('[A-Za-z0-9]+', pat[[1]], '\\s'))

Insert spaces in those words by replacing the patterns with a space and then the pattern. To do it all at once, you need to use lapply, as str_extract_all produces a list.

words2 <- lapply(1:length(words), function(x){           # a little hacky
  str_replace_all(words[[x]], pat[[1]][x], paste0(' ', pat[[1]][x]))
})

To replace all the matched words with the fixed ones, we need to run str_replace_all with each word and replacement, so we either need to update s while we loop with sapply:

sapply(1:length(words), function(x){                               # hacky
  s <<- str_replace_all(s, unlist(words)[x], unlist(words2)[x])    # hackier
})

which will produce some useless output but update s, or use a for loop, which is somewhat cleaner:

for(x in 1:length(words)){
  s <- str_replace_all(s, unlist(words)[x], unlist(words2)[x])
}

Either way, we get

> s
[1] "JOHN LION - XYZ RAN RUN TREE ABC GRASS - ABC LIMB RAN RUN LION -XYZ JOG SUN SKY - ABC LION JOHN POND ABC RUN - PDF STONE"

This worked perfectly. I decided to go with the for loop. Thanks so much. — Dre, Feb 05 '16 at 05:52

How to match a word after a certain character with out knowing the word in R?

1 Answers1