-3

I want to split the string whenever it encounters "a", provided "a" should not be followed by "b"

string <- "abcgualoo87ahhabta"

I should get output as

string <- [1]abcgua
[2]loo87a
[3]hhabta
  • Look for a regex tutorial that explains negative lookahead. – Roland Dec 16 '15 at 09:00
  • 1
    Why is the last `a` splitted? Doesn't compute with the 1-3 elements, in which the splitting `a` is the last character (and not the first). Try `strsplit(string,"(?<=a)(?!b)",perl=TRUE)` and listen to the @Roland's advice. – nicola Dec 16 '15 at 09:03
  • 1
    @nicola, I added your line in my answer to make it more visible, if you want to post it as separate answer, let me no so I can delete your line (or feel free to edit my A to remove the last part) – Cath Dec 16 '15 at 09:56
  • 1
    @CathG No problem at all, keep the line in your answer. – nicola Dec 16 '15 at 10:27
  • @nicola i corrected the last split on a – DHWANI DHOLAKIA Dec 16 '15 at 11:46

2 Answers2

6

You can split your string with the pattern "a not followed by b" with the regex a(?=[^b]) in strsplit:

split_str <- strsplit("abcgualoo87ahhabta", "a(?=[^b])", perl=TRUE)[[1]]
split_str
#[1] "abcgu"  "loo87"  "hhabta"

explanation of the split pattern: a lookahead ((?=)) is used with, as "look-ahead" pattern, anything except a b ([^b]) (the ^ sign indicates the negation). In order for the lookahead to work (be interpreted), we need to set parameter perl to TURE

Then you can add the removed "a" at the end of the splitted part, except last:

split_str <- paste0(c(rep("a", length(split_str)-1), ""))
#[1] "abcgua" "loo87a" "hhabta"

A nice one-step alternative provided by @nicola in the comments:

split_str <- strsplit("abcgualoo87ahhabta","(?<=a)(?!b)", perl=TRUE)[[1]]
#[1] "abcgua" "loo87a" "hhabta"
Cath
  • 23,906
  • 5
  • 52
  • 86
  • I would really encourage you to use `TRUE` and `FALSE` instead of `T` and `F`. –  Dec 16 '15 at 10:16
  • @CathG it would be great if you can explain the meaning of a(?=[^b]) and also if i add another condition that split string if we get "a" or "g" not followed by "b". how to precced in such scenario – DHWANI DHOLAKIA Dec 16 '15 at 12:04
  • @DHWANIDHOLAKIA I edited with an explanation, let me know if it's ok or if you'd like more details – Cath Dec 16 '15 at 12:31
  • @CathG The above explanation was very informative. For the next condition wherein i want to split the string if there is "a|g" not followed by "b". I changed the above code to **strsplit("abcgualoo87ahhabta", "a|g(?=[^b])", perl=TRUE)[[1]]** however i got very different results [1] "" "bc" "u" "loo87" "hh" "bt" – DHWANI DHOLAKIA Dec 16 '15 at 12:55
  • @DHWANIDHOLAKIA you need to put `(a|g)` in brackets, else your asking a or (g not followed by b) so it splits at every a – Cath Dec 16 '15 at 12:57
  • @CathG it worked after putting the brackets. But in case when there was only one condition to split we can add "a" to the beginning as shown in your second statement **strsplit("abcgualoo87ahhabta", "a|g(?=[^b])", perl=TRUE)[[1]]** but this case it would be complicated as we dont know where to put "a" or "g" – DHWANI DHOLAKIA Dec 16 '15 at 13:02
  • @CathG i ran the changed condition using your second one line answer and it workd . – DHWANI DHOLAKIA Dec 16 '15 at 13:41
  • @CathG Can you tell me the difference when to put parenthesis () and when [ ] in such situations – DHWANI DHOLAKIA Dec 16 '15 at 13:42
  • @DHWANIDHOLAKIA in regex [ and ( have not the same meaning (you can have a look at `?regex` to learn more) but actually `[ag](?=[^b])` instead of `(a|g)(?=[^b])` works too – Cath Dec 16 '15 at 13:45
2
string <- "abcgualoo87ahhabta"
unlist(strsplit(gsub("a([^b])", "a \\1", string), split=" "))
# [1] "abcgua" "loo87a" "hhabta"
Ven Yao
  • 3,680
  • 2
  • 27
  • 42