1

I have a dataframe with character strings that look like this:

bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea

I want to split lines separated by \n(number):(number):(number) sequence into different rows. I tried

stringr::separate_rows(df3$Transcript[1], Transcript , sep = "\\n")

and its different combinations with [A-z] and [:punct:] to no avail. What would be the most straight forward way of doing it?

Thanks

Kasia Kulma
  • 1,683
  • 1
  • 14
  • 39

1 Answers1

2

You want to split the strings with a line break that is followed with a timestamp. You may use a base R strsplit function with a PCRE regex based on a positive lookahead:

strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

See the regex demo

Pattern details

  • \R+ - 1 or more line break sequences (either \n or \r or \r\n)
  • (?=\d{2}:\d{2}:\d{2}) - followed with 2 digits, :, 2 digits, : and again 2 digits. Since (?=...) is a positive lookahead (a zero-width assertion that does not put the matched chars into the match value) the text matched with it is not removed from the results.

R demo:

s <- "bla bla.\n14:39:51 info: pyku bla .\n14:39:51 info: \n14:39:51 info: \n14:39:57 Sam: <span>pyk pyk</span>\n14:43:15 on and on \n14:43:59 you get an idea"
strsplit(s, "\\R+(?=\\d{2}:\\d{2}:\\d{2})", perl=TRUE)

Output:

[[1]]
[1] "bla bla."                           "14:39:51 info: pyku bla ."         
[3] "14:39:51 info: "                    "14:39:51 info: "                   
[5] "14:39:57 Sam: <span>pyk pyk</span>" "14:43:15 on and on "               
[7] "14:43:59 you get an idea"          
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563