How to isolate a word next to a specified word

Question

My dataframe has a variety of strings. See sample df:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

I'd like to isolate the first word in the sentence and the second-to-last. The second-to-last will always precede the word "payment."

Here's what my desired df would look like:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

The resulting strings don't need to be case sensitive.

I'm able to write code to get the first word in a sentence (split at the space) but can't figure out how to pull a word to the left (or right, for that matter) of a reference word, which is "payment" in this case.

Jonathan Carroll · Accepted Answer · 2017-08-17T04:48:47.390

df$QualityWord = sub("(\\w+).*?$", "\\1", df$strings)
df$PaymentWord = sub(".*?(\\w+) payment$", "\\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

The regex terms explained:

(\\w+) = match a word character one or more times, captured as a group
.*? = match anything, non-greedily
payment = match a space then the characters payment
$ = match the end of the string.
\\1 = substitute the pattern with what was in the first group.

score 1 · Answer 2 · answered Aug 17 '17 at 05:52

We can use extract from tidyr

library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\\w+).*\\b(\\w+)\\s+\\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

Silence Dogood · Answer 3 · 2017-08-17T05:01:58.073

With strsplit, head and tail functions:

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

OR:

With dplyr and a custom function:

customFn = function(x) { 
strObj = unlist(strsplit(x,split=" ")); 
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}

DF %>% 
dplyr::rowwise() %>% 
dplyr::do(customFn(.$strings))

How to isolate a word next to a specified word

3 Answers3