Count+Identify common words in two string vectors [R]

Question

How can I write an R function that can take two string vectors and returns the number of common words AND which common words comparing element 1 from stringvec1 to element 1 of stringvec2, element 2 of strinvec1 to element 2 of stringvec2, etc.

Suppose I have these data:

#string vector 1
strvec1 <- c("Griffin Rahea Petersen Deana Franks Morgan","Story Keisha","Douglas Landon Lark","Kinsman Megan Thrall Michael Michels Breann","Gutierrez Mccoy Tyler Westbrook Grayson Swank Shirley Didas Moriah")

#string vector 2
strvec2 <- c("Griffin Morgan Rose Manuel","Van De Grift Sarah Sell William","Mark Landon Lark","Beerman Carlee Megan Thrall Michels","Mcmillan Tyler Jonathan Westbrook Grayson Didas Lloyd Connor")

Ideally I'd have a function that would return number of common words AND what the common words are:

#Non working sample of how functions would ideally work
desiredfunction_numwords(strvec1,strvec2)
[1] 2 0 2 3 4

desiredfunction_matchwords(strvec1,strvec2)
[1] "Griffin Morgan" "" "Landon Lark" "Megan Thrall Michels" "Tyler Westbrook Grayson Didas"

score 6 · Accepted Answer · edited Dec 22 '20 at 13:25

You can split string at each word and perform the operation.

In base R :

numwords <- function(str1, str2) {
  mapply(function(x, y) length(intersect(x, y)), 
         strsplit(str1, ' '), strsplit(str2, ' '))
}

matchwords <- function(str1, str2) {
  mapply(function(x, y) paste0(intersect(x, y),collapse = " "), 
         strsplit(str1, ' '), strsplit(str2, ' '))
}

numwords(strvec1, strvec2)
#[1] 2 0 2 3 4

matchwords(strvec1, strvec2)
#[1] "Griffin Morgan"          ""                "Landon Lark"                  
#[4] "Megan Thrall Michels"          "Tyler Westbrook Grayson Didas"

score 0 · Answer 2 · answered Dec 22 '20 at 14:19

You can use strvec1 as a regex pattern by strsplitting it into separate words and pasteing the words together with the alternation marker |:

pattern <- paste0(unlist(strsplit(strvec1, " ")), collapse = "|")

You can use this pattern with str_count and str_extract_all:

library(stringr) 
# counts:
str_count(strvec2, pattern)
[1] 2 0 2 3 4

# matches:
str_extract_all(strvec2, pattern)
[[1]]
[1] "Griffin" "Morgan" 

[[2]]
character(0)

[[3]]
[1] "Landon" "Lark"  

[[4]]
[1] "Megan"   "Thrall"  "Michels"

[[5]]
[1] "Tyler"     "Westbrook" "Grayson"   "Didas"

Count+Identify common words in two string vectors [R]

2 Answers2