Coding based on strings and order of strings in R

Question

I have to code many data.frames. For example:

tt <- data.frame(V1=c("test1", "test3", "test1", "test4", "wins", "loses"),
             V2=c("someannotation", "othertext", "loads of text including the word winning for the winner and the word losing for the loser", "blablabla", "blablabla", "blablabla"))

tt 
V1       V2
test1    someannotation
test3    othertext
test1    loads of text including the word winning for the winner and the word losing for the loser
test4    blablabla
wins     blablabla
loses    blablabla

The coding has to go into a new data.frame and I have to code, if a runner wins or loses. If V1 indicates wins then he wins (and if he loses, it's indicated by loses). However, there is a possibility that the runner wins or loses parts of a race, this is indicated by test1 in V1 and specified by V2. If the term winning in V2 appears before the term losing the runner wins parts of the race (and vice-vers-ca).

I've tried to implement elements of answers from here to specify which word/string appears on which position:

find location of character in string

The implementation looks like this:

result <- data.frame()
for(i in 1:length(tt[,1])){
  if(grepl("wins", tt[i,1])) result[i,1] <- "wins"
  if(grepl("loses", tt[i,1])) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")>which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")<which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "wins"
}

But there is an error message for cells of the column V2 that don't contain either winning or losing:

Error in if (grepl("test1", tt[i, 1]) & (which(strsplit(tt[i, 2], " ")[[1]] ==  : argument is of length zero

Does someone have a work around that problem or even a sophisticated solution? Any help is appreciated, thanks!

Edit As @grrgrrbla kindly clarified, there are two possibilities to win: one is if V1 == "win", the other is if V2 contains the word "winning" before the word "losing" the runner also wins, there are 2 possibilites to lose: V1 == "loses" or V2 contains "losing" before "winning".

My output should look like this:

result
  V1
  NA
  NA
  wins
  NA
  wins
  loses

please specify what output EXACTLY you want: one column, two columns, do you just need one column saying win/lose, do you need the index etc. so from what I understand there are two possibilities to win: one is if V1 == "win", the other is if V2 contains the word "winning" before the word "losing" the runner also wins, there are 2 possibilites to lose: V1 == "loses" or V2 contains "losing" before "winning", right? the output should be one column saying "win" or "lose", right? — grrgrrbla, Feb 05 '15 at 11:09
why are there NA-values in the output?? when should a NA-value appear? what input should give NA as a result? — grrgrrbla, Feb 05 '15 at 11:20
Since only row 3, 5 and 6 contain "win/winning" or "loses/losing" the coding of the other rows should result in `NA`. — Thomas, Feb 05 '15 at 11:22

Cath · Accepted Answer · 2015-02-05T13:31:04.973

You can try (probably not the simplest solution...) to create a function that returns "wins" if either of your "winning" condition is satisfied, "loses" if either of your "losing" condition is satisfied and NA in other cases:

wilo<-function(vec){
    if(grepl("wins|loses",vec[1])){ # if the first variable is "wins" or "loses" you return the value of the first variable
        return(vec[1])
    } else {
        if(grepl("winning|losing",vec[2])){ # if in the second variable, there is winning or losing (actually both need to be in the sentence and are supposed to be so you can just check for one word : grepl("winning",vec[2]) )
            ifelse(gregexpr("winning",vec[2])[[1]]<gregexpr("losing",vec[2])[[1]], # if "winning" is placed before "losing"
                   return("wins"), # return "wins"
                   return("loses")) # else return "loses"
        } else {
            return(NA) # if none of the conditions are fulfilled, return NA
        }
    }
 }

Then you can apply the function on each rows of your data.frame:

apply(tt,1,wilo)
#[1] NA      NA      "wins"  NA      "wins"  "loses"

NB: As suggested by @grrgrrbla, an alternative to using function gregexpr is to use function str_locate from stringr package.

you could also use: `str_locate` from the stringr package to find the position of the winning and losing terms and see which one is smaller: `ifelse(str_locate(vec[2], "winning") - str_locate(vec[2], "losing" ) <0,return("wins"),return("lose"))` — grrgrrbla, Feb 05 '15 at 11:54
@grrgrrbla, yes indeed, thanks for the comment, I'm not "used to use" this package, I'll add the alternative, thanks — Cath, Feb 05 '15 at 11:56

Coding based on strings and order of strings in R

1 Answers1