I'm working on speech in conversational speaking turns and want to extract words that are repeated across turns. The task I'm grappling with is to extract words that inexactly repeated.
Data:
X <- data.frame(
speaker = c("A","B","A","B"),
speech = c("i'm gonna take a look you okay with that",
"sure looks good we can take a look you go first",
"okay last time I looked was different i think that is it yeah",
"yes you're right i think that's it"), stringsAsFactors = F
)
I have a for
loop that successfully extracts exact repetitions:
# initialize vectors:
pattern1 <- c()
extracted1 <- c()
# run `for` loop:
library(stringr)
for(i in 2:nrow(X)){
# define each 'speech` element as a pattern for the next `speech` element:
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(X$speech[i-1], " ")), collapse = "|"), ")\\b")
# extract all matched words:
extracted1[i] <- str_extract_all(X$speech[i], pattern1[i-1])
}
# result:
extracted1
[[1]]
NULL
[[2]]
[1] "take" "a" "look" "you"
[[3]]
character(0)
[[4]]
[1] "i" "think" "that" "it"
However, I also want to extract inexact repetitions. For example, looks
in row #2 is an inexact repetition of look
in row #1, looked
in row #3 fuzzily repeats looks
in row #2, and yes
in row #4 is an approximate match of yeah
in row #3.
I've recently come across agrep
, which is used for approximate matching, but I don't know how to use it here or whether it's the right way to go at all. Any help is greatly appreciated.
Note that the actual data comprises thousands of speaking turns with highly unpredictable content so that it's not possible to define a list of all possible variants beforehand.