3

I am facing a problem with regex and strsplit. I would like to split the following x string based on the second : symbol

x <- "26/11/19, 22:16 - Super Mario: It's a me: Super Mario!, but also : the princess"

and obtain then something like this

"26/11/19, 22:16 - Super Mario"
" It's a me: Super Mario!, but also : the princess"

I am using by using strsplit with the following regular expression that in based on my little know-how should reason like "select ONLY the colon symbol followed by a space and preceded by ONLY letters".

I tried to make the regex non greedy with the ? symbol but clearly I am missing something and the result does not work as expected because it includes also me: in the splitting operation.

It is essential I think to have a non greedy operator, because the string here is just an example I do not have always the word Mario of course.

strsplit(x, "(?<=[[:alpha:]]):(?= )", perl = TRUE)

Thank you in andvance!

SabDeM
  • 7,050
  • 2
  • 25
  • 38
  • I am confused. The colon in `Mario: ` is the first, not second, colon followed by a space and preceded by a letter. Please be more precise in stating your requirements. – Cary Swoveland May 08 '20 at 21:47
  • 1
    do you always have a time stamp? `strsplit(x, '\\d.*?:.*?:\\K', perl = TRUE)` – rawr May 08 '20 at 21:48
  • Do you mean split on the first colon that is followed by a space and preceded by a letter? – Cary Swoveland May 08 '20 at 21:49
  • @rawr I think you got it! please add it as an answer... and, if I may ask, could you enlight me on the regex? I get (almost) all of it. Thank you very much! – SabDeM May 08 '20 at 21:51
  • @SabDeM i wouldn't trust it, it could break in some examples. it matches a digit followed by two colons then resets, so if there is a digit after the second colon, it won't work. without the `\\d`, it will split after every two colons. I don't know enough regex to fix that – rawr May 08 '20 at 21:57
  • @rawr thank you very much you gave me some very good ideas. Appreciate it! – SabDeM May 08 '20 at 22:01
  • @CarySwoveland I want to split based on the second colon. – SabDeM May 08 '20 at 22:47
  • That means that for the strings `a:b:c: d` and `a: b:c: d` you wish to split on the colon between `b` and `c`. Correct? If so, what is the purpose of the italicized clause in your question? – Cary Swoveland May 08 '20 at 22:56
  • Correct; no purpose at all, it was just "quoting" my thoughts... I do not get your point. – SabDeM May 08 '20 at 23:02
  • If you look at @akrun's `str_split` answer you will see that he/she matched on a colon preceded by a letter (`(?<=[[:alpha:]])`) and followed by a space (`(?= )`). We both understood that you are only interested in splitting on a colon that is preceded by a letter and followed by a space. I suggest you remove the italicized clause. – Cary Swoveland May 08 '20 at 23:18

1 Answers1

2

We can replace the first occurrence of ':' by another character or just replicate it and then use strsplit

strsplit(sub("([[:alpha:]]):", "\\1::", x),
       "(?<=[[:alpha:]]):{2,}(?= )", perl = TRUE)[[1]]
#[1] "26/11/19, 22:16 - Super Mario"       
#[2] " It's a me: Super Mario!, but also : the princess"

Or with str_split

library(stringr)
str_split(x, "(?<=[[:alpha:]]):(?= )", n = 2)[[1]]
#[1] "26/11/19, 22:16 - Super Mario"   
#[2] " It's a me: Super Mario!, but also : the princess"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    Thank you but I do not have always Mario, that's why I decided to use a `[[:alpha:]]`, the string here is just an example. – SabDeM May 08 '20 at 21:17