R regex: split a string by combination of \\n [A-z] & [:punct:]

Question

I have a dataframe with character strings that look like this:

bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea

I want to split lines separated by \n(number):(number):(number) sequence into different rows. I tried

stringr::separate_rows(df3$Transcript[1], Transcript , sep = "\\n")

and its different combinations with [A-z] and [:punct:] to no avail. What would be the most straight forward way of doing it?

Thanks

Just to be clear: `strsplit(s, "\n")` does not work for you, does it? If not use `strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)` — Wiktor Stribiżew, Oct 12 '17 at 08:47
@AvinashRaj: thank you, that worked brilliantly! if you post it as the answer, I'll happilly accept it, thanks! — Kasia Kulma, Oct 12 '17 at 08:51
@KasiaKulma Wiktor will write more beautiful answer than me :-) — Avinash Raj, Oct 12 '17 at 08:53
@KasiaKulma You do not need to use `stringr` for that, you may use a base R `strsplit`, see my top comment. Besides, PCRE regex contains a very nice `\R` construct to match *any* line breaks. *stringr* is based on ICU regex library and it has no `\R` support. Does `strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)` work for you, too? — Wiktor Stribiżew, Oct 12 '17 at 08:54
@WiktorStribiżew: your solution produces exactly the same answer as AvinashRaj's, I'm not sure what's the advantage..? — Kasia Kulma, Oct 12 '17 at 09:01

score 2 · Accepted Answer · answered Oct 12 '17 at 09:04

You want to split the strings with a line break that is followed with a timestamp. You may use a base R strsplit function with a PCRE regex based on a positive lookahead:

strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

See the regex demo

Pattern details

\R+ - 1 or more line break sequences (either \n or \r or \r\n)
(?=\d{2}:\d{2}:\d{2}) - followed with 2 digits, :, 2 digits, : and again 2 digits. Since (?=...) is a positive lookahead (a zero-width assertion that does not put the matched chars into the match value) the text matched with it is not removed from the results.

R demo:

s <- "bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea"
strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

Output:

[[1]]
[1] "bla bla."                           "14:39:51 info: pyku bla ."         
[3] "14:39:51 info: "                    "14:39:51 info: "                   
[5] "14:39:57 Sam: <span>pyk pyk</span>" "14:43:15 on and on "               
[7] "14:43:59 you get an idea"

R regex: split a string by combination of \\n [A-z] & [:punct:]

1 Answers1