4

I am trying to split sentences based on different criteria. I am looking to split some sentences after " is" and some after " never". I was able to split sentences based on either of these conditions but not both.

str <- matrix(c("This is line one", "This is not line one", 
                "This can never be line one"), nrow = 3, ncol = 1)

>str
     [,1]                        
[1,] "This is line one"          
[2,] "This is not line one"      
[3,] "This can never be line one"

str2 <- apply(str, 1, function (x) strsplit(x, " is", fixed = TRUE))

> str2
[[1]]
[[1]][[1]]
[1] "This"      " line one"


[[2]]
[[2]][[1]]
[1] "This"          " not line one"


[[3]]
[[3]][[1]]
[1] "This can never be line one"

I would like to split the last sentence after " never". I am not sure how to do that.

RDPD
  • 555
  • 3
  • 8
  • 18
  • 2
    FYI `strsplit` is vectorized. No need for `apply` – Sotos Sep 05 '16 at 06:30
  • 3
    Mabye `strsplit(x," is | never ")`? – zx8754 Sep 05 '16 at 06:32
  • @akrun again all I am saying it is a *Possible* duplicate, basically both questions want to use OR operator in regex. Also, it is good to have related posts linked. – zx8754 Sep 05 '16 at 06:39
  • 1
    @akrun post is not even tagged with regex, "is" and "never" are fixed words. We obviously have different thresholds to accept a post as a dupe, let's leave it at that. – zx8754 Sep 05 '16 at 06:44
  • Sorry, the linked one for dupe is not related to this. So, reopening it. – akrun Sep 05 '16 at 08:19

1 Answers1

3

We can use regex lookarounds to split the lines at the space after the 'is' or 'never'. Here, the (?<=\\bis)\\s+ matches one or more spaces (\\s+) that follows a is or | to match spaces (\\s+) that follows the 'never' word.

strsplit(str[,1], "(?<=\\bis)\\s+|(?<=\\bnever)\\s+", perl = TRUE)
#[[1]]
#[1] "This is"  "line one"

#[[2]]
#[1] "This is"      "not line one"

#[[3]]
#[1] "This can never" "be line one"   

If we want to remove the 'is' and 'never' also

strsplit(str[,1], "(?:\\s+(is|never)\\s+)")
#[[1]]
#[1] "This"     "line one"

#[[2]]
#[1] "This"         "not line one"

#[[3]]
#[1] "This can"    "be line one"
akrun
  • 874,273
  • 37
  • 540
  • 662