0

I've a collection of text messages scraped from a forum into a data frame. Here's a reproducible example:

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

My idea is to find the longest common suffix found on every message for the same author for all those who have written more than on message. For all others, well, I'll find a regex way that gracefully degradates.

I found the bioconductor package RLibstree that looks promising, thanks to the function getLongestCommonSubstring, but I don't know how to group the function to all the messages from the same author.

halfer
  • 19,824
  • 17
  • 99
  • 186
Gabriele B
  • 2,665
  • 1
  • 25
  • 40
  • 1
    I'm not familiar with bioconductor package, but as a pointer: perhaps you can use packages like `data.table` or `dplyr` to apply the substring function by author-groups very efficiently. Another possibility might be to `split` the data into a list of data frames (one per author), then apply the substring function using `lapply` and afterwards `rbind` the list back to a single data.frame. – talat Oct 22 '14 at 11:35
  • Rlibstree is not actually a Bioconductor package; it's in the Bioconductor extra repository, meaning it's needed to build other Bioconductor packages. I think its real home is on github: https://github.com/omegahat/Rlibstree. I'm removing the Bioconductor tag from this question as it's not really relevant. – Dan Tenenbaum Oct 22 '14 at 20:46

3 Answers3

0

I think I'd convert to a list in the following format and use the stringdist package to find common sentences and remove any above a certain threshold of similarity for all sentences used by an author. outer may be of use here as well:

## load packages in this order
library(stringi)
library(magrittr)

example.df[["message"]] %>% 
    stringi::stri_split_regex(., "(?<=[.?!]{1,5})\\s+") %>%
    split(example.df[["author"]])

## $Daisy
## $Daisy[[1]]
## [1] "Quack Quack!" "Daisy Duck"  
## 
## $Daisy[[2]]
## [1] "Quack Quack!" "Daisy Duck"  
## 
## 
## $Donald
## $Donald[[1]]
## [1] "Quack Quack!" "Donald Duck" 
## 
## 
## $Mikey
## $Mikey[[1]]
## [1] "Hello World!" "Mikey Mouse" 
## 
## $Mikey[[2]]
## [1] "I was born in 1928." "Mikey Mouse"        
## 
## 
## $Minnie
## $Minnie[[1]]
## [1] "The quick fox jump over Minnie Mouse"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

I don't know how to group the function to all the messages from the same author.

Perhaps tapply is what you are looking for.

> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"

$Donald
[1] "Quack Quack! Donald Duck"

$Mikey
[1] "Hello World! Mikey Mouse"        "I was born in 1928. Mikey Mouse"

$Minnie
[1] "The quick fox jump over Minnie Mouse"

You can use your own function in place of function(x) x, of course.

Armali
  • 18,255
  • 14
  • 57
  • 171
0

Here is an implementation that uses no additional libraries.

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
                                  "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

signlen = function(am)  # determine signature length of an author's messages
{
    if (length(am) <= 1) return(0)  # return if not more than 1 message

    # turn the messages into reversed vectors of single characters
    # in order to conveniently access the suffixes from index 1 on
    am = lapply(strsplit(as.character(am), ''), rev)
    # find the longest common suffix in the messages
    longest_common = .Machine$integer.max
    for (m in 2:length(am))
    {
        i = 1
        max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
        while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
        longest_common = i-1
        if (longest_common == 0) return(0)  # shortcut: need not look further
    }
    return(longest_common)
}

# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald  Mikey Minnie 
#    23      0     12      0 

# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald  Mikey  Daisy Minnie  Daisy 
#    12     24     19      0     36      0 

# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
#  author                              message
#1  Mikey                         Hello World!
#2 Donald             Quack Quack! Donald Duck
#3  Mikey                  I was born in 1928.
#4  Daisy                                     
#5 Minnie The quick fox jump over Minnie Mouse
#6  Daisy                                     
Armali
  • 18,255
  • 14
  • 57
  • 171