I´m trying to split a character vector containing messages right in front of a date-time indicator.
I was thinking about using strsplit()
with a regular expression and perl = TRUE
Here´s some example data:
TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
This is what I tried so far:
Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut
according to this website, the regex should cut the string right in front of the date-time indicator. However, the result I get looks like this, with the first character getting cut off:
[1] "0"
[2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"
[3] "0"
[4] "5.10.17, 09:27 - Person One: I could bring some beer\n"
[5] "0"
[6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[7] "0"
[8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[9] "0"
[10] "5.10.17, 09:27 - Person Two: ???"
[11] "0"
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"
[13] "0"
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
This is what the result should look like:
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"
[5] "05.10.17, 09:27 - Person Two: ???\n"
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"
[7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
Note: I can´t split the data at the newline indicator because some of the messages contain one or more of those in the middle of the message.