1

I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.

For example, head(mydataframe) returns

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

Assuming a sentence can be terminated by either

"." or "?" or "!" or "..."

are there any R library functions capable of outputting the following:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.
Scott
  • 446
  • 4
  • 16
  • What function did you use to read in the data? What does the data source look like? – tchakravarty Nov 15 '15 at 11:31
  • I wrote a function to parse an .srt file into a dataframe. Everything from the srt was removed except for what you see above. – Scott Nov 15 '15 at 11:45

2 Answers2

4

This should work for all the sentences ending with: . ... ? or !

x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))

Credits to @AvinashRaj for the pointers on the lookbehind

Which gives:

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah..."                                           
#[4] "No, I'm sorry." 

Data

I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
3

Here is what I got. I am sure there are better ways to do this. Here I used base functions. I created a sample data frame called foo. First, I created a string with all texts in txt. toString() adds ,, so I removed them in the first gsub(). Then, I took care of white space (more than 2 spaces) in the second gsub(). Then, I split the string by the delimiters you specified. Crediting Tyler Rinker for this post, I managed to leave delimiters in strsplit(). The final job was to remove white space at sentence initial position. Then, unlist the list.

EDIT Steven Beaupré revised my code. That is the way to go!

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah.", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

library(magrittr)

toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x) 
            {gsub(pattern = "^ ", replacement = "", x = x)
      }) %>%
unlist

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah."                                             
#[4] "No I'm sorry." 
Community
  • 1
  • 1
jazzurro
  • 23,179
  • 35
  • 66
  • 76