1

My dataframe has a variety of strings. See sample df:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
    df <- data.frame(strings, stringsAsFactors = F)

I'd like to isolate the first word in the sentence and the second-to-last. The second-to-last will always precede the word "payment."

Here's what my desired df would look like:

strings <- c("Average complications and higher payment",
        "Average complications and average payment",
        "Average complications and lower payment",
        "Average mortality and higher payment",
        "Better mortality and average payment")
QualityWord <- c("Average","Average","Average","Average","Better")
PaymentWord <- c("Higher","Average","Lower","Higher","Average")
desireddf <- data.frame(strings, QualityWord, PaymentWord, stringsAsFactors = F)

The resulting strings don't need to be case sensitive.

I'm able to write code to get the first word in a sentence (split at the space) but can't figure out how to pull a word to the left (or right, for that matter) of a reference word, which is "payment" in this case.

jesstme
  • 604
  • 2
  • 10
  • 25

3 Answers3

1
df$QualityWord = sub("(\\w+).*?$", "\\1", df$strings)
df$PaymentWord = sub(".*?(\\w+) payment$", "\\1", df$strings)

df
#>                                     strings QualityWord PaymentWord
#> 1  Average complications and higher payment     Average      higher
#> 2 Average complications and average payment     Average     average
#> 3   Average complications and lower payment     Average       lower
#> 4      Average mortality and higher payment     Average      higher
#> 5      Better mortality and average payment      Better     average

The regex terms explained:

  • (\\w+) = match a word character one or more times, captured as a group
  • .*? = match anything, non-greedily
  • payment = match a space then the characters payment
  • $ = match the end of the string.
  • \\1 = substitute the pattern with what was in the first group.
Jonathan Carroll
  • 3,897
  • 14
  • 34
1

We can use extract from tidyr

library(tidyverse)
df %>%
   extract(strings, into = c("QaulityWord", "PaymentWord"),
           "^(\\w+).*\\b(\\w+)\\s+\\w+$", remove = FALSE)
#                                   strings QaulityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average
akrun
  • 874,273
  • 37
  • 540
  • 662
0

With strsplit, head and tail functions:

outDF = do.call(rbind,lapply(DF$strings,function(x) {

#split string
strObj = unlist(strsplit(x,split=" "))

#outputDF
data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE) 

}))

outDF
#                                    strings QualityWord PaymentWord
#1  Average complications and higher payment     Average      higher
#2 Average complications and average payment     Average     average
#3   Average complications and lower payment     Average       lower
#4      Average mortality and higher payment     Average      higher
#5      Better mortality and average payment      Better     average

OR:

With dplyr and a custom function:

customFn = function(x) { 
strObj = unlist(strsplit(x,split=" ")); 
outputDF = data.frame(strings = x,QualityWord=head(strObj,1),PaymentWord= head(tail(strObj,2),1),stringsAsFactors=FALSE);
}

DF %>% 
dplyr::rowwise() %>% 
dplyr::do(customFn(.$strings))
Silence Dogood
  • 3,587
  • 1
  • 13
  • 17