Combining fragmented sentences in an R dataframe

Question

I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.

For example, head(mydataframe) returns

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

Assuming a sentence can be terminated by either

"." or "?" or "!" or "..."

are there any R library functions capable of outputting the following:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.

What function did you use to read in the data? What does the data source look like? — tchakravarty, Nov 15 '15 at 11:31
I wrote a function to parse an .srt file into a dataframe. Everything from the srt was removed except for what you see above. — Scott, Nov 15 '15 at 11:45

Steven Beaupré · Accepted Answer · 2015-11-15T13:38:54.440

4

This should work for all the sentences ending with: . ... ? or !

x <- paste0(foo$txt, collapse = " ")
trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))

Credits to @AvinashRaj for the pointers on the lookbehind

Which gives:

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah..."                                           
#[4] "No, I'm sorry."

Data

I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

edited Nov 15 '15 at 13:38

answered Nov 15 '15 at 12:34

Steven Beaupré

21,343
7
57
77

Concise answer but no space between 'what' and 'they' on first line. – Scott Nov 15 '15 at 13:10
2

@ScottHorvath It is up to you, but I think Steve's answer is concise and deserves more. – jazzurro Nov 15 '15 at 13:18
1

@StevenBeaupré My apology for the typo. – jazzurro Nov 15 '15 at 13:41
1

@ScottHorvath Fixed the "space" issue and edited to handle the case where a string could end with `...` – Steven Beaupré Nov 15 '15 at 13:47

score 3 · Answer 2 · edited May 23 '17 at 11:59

3

Here is what I got. I am sure there are better ways to do this. Here I used base functions. I created a sample data frame called foo. First, I created a string with all texts in txt. toString() adds ,, so I removed them in the first gsub(). Then, I took care of white space (more than 2 spaces) in the second gsub(). Then, I split the string by the delimiters you specified. Crediting Tyler Rinker for this post, I managed to leave delimiters in strsplit(). The final job was to remove white space at sentence initial position. Then, unlist the list.

EDIT Steven Beaupré revised my code. That is the way to go!

foo <- data.frame(num = 1:6,
                  txt = c("Do you have any idea what", "they were arguing about?",
                          "Do--Do you speak", "English?", "yeah.", "No, I'm sorry."), 
                  stringsAsFactors = FALSE)

library(magrittr)

toString(foo$txt) %>%
gsub(pattern = ",", replacement = "", x = .) %>%
strsplit(x = ., split = "(?<=[?.!])", perl = TRUE) %>%
lapply(., function(x) 
            {gsub(pattern = "^ ", replacement = "", x = x)
      }) %>%
unlist

#[1] "Do you have any idea what they were arguing about?"
#[2] "Do--Do you speak English?"                         
#[3] "yeah."                                             
#[4] "No I'm sorry."

edited May 23 '17 at 11:59

Community

1
1

answered Nov 15 '15 at 12:16

jazzurro

23,179
35
66
76

Or maybe: `x <- paste0(foo$txt, collapse = ""); unlist(strsplit(x, "(?<=[?.!|])", perl=TRUE))` – Steven Beaupré Nov 15 '15 at 12:35
@StevenBeaupré Man, I think you want to keep that answer up. Yours is much better. You do not have to deal with all space issues in your solution. I was about to upvote it. – jazzurro Nov 15 '15 at 12:36
Ok will undelete. But it's the same idea, just using `paste0` instead of `toString` to simplify the space issue. – Steven Beaupré Nov 15 '15 at 12:38
1

@StevenBeaupré Go for it! – jazzurro Nov 15 '15 at 12:40
Why did you have a leading space before "English" in the dataset ? – Steven Beaupré Nov 15 '15 at 13:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95171/discussion-between-steven-beaupre-and-jazzurro). – Steven Beaupré Nov 15 '15 at 13:44

Combining fragmented sentences in an R dataframe

2 Answers2