0

I´m trying to split a character vector containing messages right in front of a date-time indicator.

I was thinking about using strsplit() with a regular expression and perl = TRUE

Here´s some example data:

TEST <- c("05.10.17, 09:26 - Person One: How about we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")

This is what I tried so far:

Cut <- unlist(strsplit(TEST,"(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))
Cut

according to this website, the regex should cut the string right in front of the date-time indicator. However, the result I get looks like this, with the first character getting cut off:

 [1] "0"                                                                                   
 [2] "5.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
 [3] "0"                                                                                   
 [4] "5.10.17, 09:27 - Person One: I could bring some beer\n"                              
 [5] "0"                                                                                   
 [6] "5.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
 [7] "0"                                                                                   
 [8] "5.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [9] "0"                                                                                   
[10] "5.10.17, 09:27 - Person Two: ???"                                                                   
[11] "0"                                                                                   
[12] "5.10.17, 09:28 - Person Two: You guys have history?\n"                               
[13] "0"                                                                                   
[14] "5.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

This is what the result should look like:

 [1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                                                                                   
 [2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                         
 [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"                                                                                   
 [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
 [5] "05.10.17, 09:27 - Person Two: ???\n"                                                                                   
 [6] "05.10.17, 09:28 - Person Two: You guys have history?\n"  
 [7] 05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n" 

Note: I can´t split the data at the newline indicator because some of the messages contain one or more of those in the middle of the message.

ikegami
  • 367,544
  • 15
  • 269
  • 518
Ju Ko
  • 466
  • 7
  • 22

3 Answers3

2

You just need to create a splitting pattern when \n is followed by the date.

 strsplit(gsub("(.*?\\n)(\\d+[.]\\d+[.]\\d+)","\\1SPLITHERE\\2",TEST),"SPLITHERE")
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"

You can also use rematches from base r

 regmatches(TEST,gregexpr(".*?\\n",TEST))
[[1]]
[1] "05.10.17, 09:26 - Person One: How about we chill on sunday\n"                         
[2] "05.10.17, 09:27 - Person One: I could bring some beer\n"                              
[3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n"  
[4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n"                              
[5] "05.10.17, 09:27 - Person Two: ???\n"                                                  
[6] "05.10.17, 09:28 - Person Two: You guys have history?\n"                               
[7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • This works well in the example but could you explain what the Regex does? It looks like its somehow based on the \n string and not the actual date-time string? I guess this would cause problems if there´s a newline in the middle of the message? – Ju Ko Jan 27 '18 at 17:00
  • @JuKo Ohh i see what you mean. I have added a solution that is more convenient to use – Onyambu Jan 28 '18 at 04:05
1

You can add a white character class \\s before your positive lookahead.

I have slightly changed your example to make it match your question more precisely (ie add \n inside the titles)

> TEST <- c("05.10.17, 09:26 - Person One: How about\n we chill on sunday\n05.10.17, 09:27 - Person One: I could bring some beer\n05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards\n05.10.17, 09:27 - Person One: shit man, not LiNDA -.-\n05.10.17, 09:27 - Person Two: ???\n05.10.17, 09:28 - Person Two: You guys have history?\n05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n")
> unlist(strsplit(TEST,"\\s(?=[0-3][0-9][.][0-9]{2}[.][0-9]{2}[,][ ][0-9]{2}:[0-9]{2})", perl = TRUE))

## [1] "05.10.17, 09:26 - Person One: How about\n we chill on sunday"                         
## [2] "05.10.17, 09:27 - Person One: I could bring some beer"                                
## [3] "05.10.17, 09:27 - Person Two: Sounds good, we could go to Lindas Party afterwards"    
## [4] "05.10.17, 09:27 - Person One: shit man, not LiNDA -.-"                                
## [5] "05.10.17, 09:27 - Person Two: ???"                                                    
## [6] "05.10.17, 09:28 - Person Two: You guys have history?"                                 
## [7] "05.10.17, 09:28 - Person One: She killed my family and sold their ears as souvenirs\n"
Gilles San Martin
  • 4,224
  • 1
  • 18
  • 31
1
strsplit(TEST, '(?<=\\\n|^)(0)',perl=T)[[1]][2:7]
Shenglin Chen
  • 4,504
  • 11
  • 11