3

Let's consider a df with two columns word and stem. I want to create a new column that checks whether the value in stem is entailed in word and whether it is preceded or succeeded by some more characters. The final result should look like this:

WORD     STEM     NEW
rerun    run      prefixed
runner   run      suffixed
run      run      none
...      ...      ...

And below you can see my code so far. However, it does not work because the grepl expression is applied on all rows of the df. Anyways, I think it should make clear my idea.

df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
             ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
                ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))
Ric S
  • 9,073
  • 3
  • 25
  • 51
hyhno01
  • 177
  • 8

3 Answers3

2

You can use mapply to use grepl per line like:

ifelse(mapply(grepl, paste0(".+", x$STEM, ".+"), x$WORD), "both",
ifelse(mapply(grepl, paste0(x$STEM, ".+"), x$WORD), "suffixed",
ifelse(mapply(grepl, paste0(".+", x$STEM), x$WORD), "prefixed", "none")))
#"prefixed" "suffixed"     "none" 

Or using startsWith and endsWith and use subseting form vector:

c("none", "both", "prefixed", "suffixed")[1 + (1 + startsWith(x$WORD, x$STEM) +
 2*endsWith(x$WORD, x$STEM)) * (nchar(x$WORD) > nchar(x$STEM) &
 mapply(grepl, x$STEM, x$WORD))]
#[1] "suffixed" "prefixed" "none"    
GKi
  • 37,245
  • 2
  • 26
  • 48
1

You can create the new column like this

df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                 ifelse(startsWith(df$word, df$stem), 'suffixed',
                        ifelse(endsWith(df$word, df$stem), 'prefixed',
                               'both')))

Or, in you are in a dplyr pipeline and you want to avoid all the annoying df$

df %>%
  mutate(new = ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                      ifelse(startsWith(df$word, df$stem), 'suffixed',
                             ifelse(endsWith(df$word, df$stem), 'prefixed',
                                    'both'))))

Output

#       word stem     new1
# 1    rerun  run prefixed
# 2   runner  run suffixed
# 3      run  run     none
# 4    aruna  run     both
Ric S
  • 9,073
  • 3
  • 25
  • 51
  • 1
    Thank you for the fast response. I select this answer as the solution because is it the most similar compared to my approach. Anyways, the Ian Campbell also solves the problem – hyhno01 Jun 08 '20 at 15:25
  • @hyhno01 Just to let you know, I've updated my answer: I canceled the part in which I compare the `nchar` of word and stem because I realised that it was superfluous. – Ric S Jun 08 '20 at 16:08
1

Here's an approach with str_locate from stringr and dplyr:

library(dplyr)
library(stringr)
data %>%
  mutate_at(vars(WORD,STEM), as.character) %>%
  mutate(NEW = 
         case_when(str_locate(WORD,STEM)[,"start"] > 1 &
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
                   str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
                   TRUE ~ "none"))
    WORD STEM      NEW
1  rerun  run prefixed
2 runner  run suffixed
3    run  run     none

I added a line to convert WORD and STEM to character in case they were factors.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57