9

Would like to split a vector of character elements text in sentences. There are more then one pattern of splitting criteria ("and/ERT", "/$"). Also there are exceptions(:/$., and/ERT then, ./$. Smiley) from the patterns.

The try: Match the cases where the split should be. Insert an unusual pattern ("^&*") at that place. strsplit the specific pattern

Problem: I don't know how to handle properly exceptions. There are explicit cases where the unusual pattern ("^&*") should be eliminated and the original text restored before running strsplit.

Code:

text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # 

# Ideal split:
textsplitted
> textsplitted
[[1]]
 [1] "This are faulty propositions one and/ERT" 
 [2] "two ,/$," 
 [3] "which I want to split ./$."
 [4] "There are cases where I explicitly want and/ERT" 
 [5] "some where I don't want to split ./$." 
 [6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
 [7] "This is also one case where I dont't want to split ./$. Smiley !/$." 
 [8] "Thank you ./$!"

[[2]]
 [1] "This are the same faulty propositions one and/ERT 
 [2] "two ,/$,"
#...      

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)
alex
  • 1,103
  • 1
  • 14
  • 25
  • Do you want to split on `"/$"` or `",/$,"` as per your output? – Simon O'Hanlon Sep 09 '13 at 12:04
  • @SimonO101 thank you! Want to split for every `"/$"`, `"and\\/ERT"` considering exceptions `":/$."`, `"and/ERT then"`, `"./$. Smiley"`. See also the comment at line #5. – alex Sep 09 '13 at 12:09
  • 1
    You don't need to escape `/` it's not a metacharacter in ere's. Ok if you match `/\\$` you can't get a split that ends `,/$,`, it will end `,/$` and the newline will start `,`. Also in `strsplit` the characters used up are swallowed(?) by the split so they disappear. I am working on a regex but you need to be really explicit how you want to split the string! – Simon O'Hanlon Sep 09 '13 at 12:12
  • @SimonO101: thanks. `/`-escape corrected. Do you know an alternative to `strsplit` in order to avoid the walk-around with the 'special pattern'. Ideally where you can input the split and exception cases directly? The result should keep the whole matched 'word' (between spaces) in the first 'proposition', then split and begin a new 'line'/element. – alex Sep 09 '13 at 12:27
  • 1
    Well I think you can do it using `regexec` and `regmatches` but an alternative could be to match the space after the thing you want to split on using zero-width assertions. I think I have a working one to post. – Simon O'Hanlon Sep 09 '13 at 12:28
  • Why is there a split between items 7 and 8? – Simon O'Hanlon Sep 09 '13 at 12:52
  • 1
    @SimonO101: was a mistake from copy-pasting.. – alex Sep 09 '13 at 13:01

1 Answers1

10

I think you can use this expression to attain the splits you want. As strsplit uses up the characters it splits on you will have to split on the spaces following the things to match for/not to match for (which is what you have in the desired output in your OP):

strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"                                 
#[2] "two ,/$,"                                                                 
#[3] "which I want to split ./$."                                               
#[4] "There are cases where I explicitly want and/ERT"                          
#[5] "some where I don't want to split ./$."                                    
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."      
#[8] "Thank you ./$!" 

Explanation

  • (?<=and/ERT)\\s - split on a space, \\s that IS preceded, (?<=...) by "and/ERT"
  • (?!then) - BUT only if that space is NOT followed, (?!...) by "then"
  • | - OR operator to chain the next expression
  • (?<=/\\$[[:punct:]]) - positive look-behind assertion for "/$" followed by any letter of punctuation
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley) - match a space that is NOT preceded by ":/$"[[:punct:]] (but according to the previous point IS preceded by "/$[[:punct:]]" but NOT followed, (?!...) by "Smiley"
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • if you find a moment, would be very thankfull if you shortly explain what the elements are doing. `?` think it is for the one position before the match; `<` ; `=` ; `\\s` ; `(?!then)` ; `|` another case, logical OR; `[[:punct:]]` punctuation characters. – alex Sep 09 '13 at 13:17
  • 1
    Ok - give me a few minutes to type it up. :-) Glad it worked! – Simon O'Hanlon Sep 09 '13 at 13:23
  • 1
    @SimonO101, +1. Awesome stuff, and nice that you take the time to explain in such detail. – A5C1D2H2I1M1N2O1R2T1 Sep 09 '13 at 15:02
  • @SimonO101: one more question (: it is possible to make something like `(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)(?!Vampire)(?!Alien)` in order to match also cases where `"Vampire"`, `"Alien"` follows the regular expression exception instead of the `"Smiley"`? Like: `"This is also one case where I dont't want to split ./$. Vampire !/$."` , `"This is also one case where I dont't want to split ./$. Alien !/$."` – alex Sep 09 '13 at 15:34
  • @SimonO101: How would you split at every `and/ERT` only when it is not succeded by `"/V"` inside **one word** after in `faulty and/ERT something/VBN and/ERT else/VHGB and/ERT as/VVFIN propositions one and/ERT two/CDF and/ERT three/ABC ` – alex Sep 10 '13 at 12:43
  • @alex come on! This seems like it should be a separate question, it's getting fairly involved again! It's bad form to keep asking additional questions in the comments. Post a new question then everyone can tackle it. Thanks. – Simon O'Hanlon Sep 10 '13 at 12:45
  • @SimonO101: sorry. have a nice day. – alex Sep 10 '13 at 12:47
  • @SimonO101 done :) Thank you! [link](http://stackoverflow.com/questions/18719809/r-split-only-when-special-regex-condition-doesnt-match/18720051?noredirect=1#comment27585919_18720051) – alex Sep 10 '13 at 14:38
  • @alex great! And see you also got upvotes for the new question and insight from other more capable regexers! Good stuff. – Simon O'Hanlon Sep 10 '13 at 14:39