1

I have 5 vectors of strings and each vector has different number of elements. However, there are many elements in these vectors which are common.

Ex v1<-c("a","x","y","z")
v2 <-c("b","g","m","r","s","x","z")
v3 <-c("a","m","x","y","z","b","r","g")
v4 <-c("d","h","a","g","s","x")
v5 <-c("a","b","m","x","y","z")

I want to calculate the percentage of matches between all the vectors,depending on the number of elements matching. I do not want to compare it using the order of elements so we have to check each element of one vector against every element of every other vector. Here the max matches are between v1 and v5. We can say that the v1 and v5 have (8/10)*100=80% Thus I want all sets of two vectors with percentages higher that 50%.

Abhi
  • 61
  • 2
  • 9
  • Shouldn't v1,v5 be 40% since the matches are a,x,y,z (4) out of a,x,y,z and a,b,m,x,y,z (4 + 6 = 10) ? – DarrenRhodes Apr 15 '16 at 10:20
  • I am not much worried about the metrics used to calculate the matches. I used (4*2)/10 as I thought it could give me a reasonable value. – Abhi Apr 15 '16 at 10:26
  • Yes v2 and v3 have 93.3% match. Sorry I missed that. – Abhi Apr 15 '16 at 10:27
  • v2 and v3 give (6*2)/15 matches. This is equal to 80% if you calculate it using the same formula. – Abhi Apr 15 '16 at 10:34

2 Answers2

3

An easy implementation would be to compare all combinations of two vectors. You can then use intersect to find the number of common values.

require(caTools)
comb <- combs(c("v1","v2","v3","v4","v5"), 2)

for (i in 1:nrow(comb)) {
    a <- eval(parse(text = comb[i, 1]))
    b <- eval(parse(text = comb[i, 2]))
    prct <- 2 * length(intersect(a, b)) / (length(a) + length(b))
    cat("\nMatching between", comb[i, 1], "and", comb[i, 2], "is", prct)
}

(Here prct is calculated as I think you've described in your example with v1 and v5)

Note that you can also do this using two nested for-loops, but I find combs easier to use to avoid duplicate combinations.

radiumhead
  • 502
  • 2
  • 9
  • Can you please help me with the modifications needed in this code so that the code can be used with vectors which contain a group of words such as "the iron metal" instead of single characters such as "a". Again we want to look for complete matches of "the iron metal". I am not quite familiar to working with strings I am not able to figure out the role of parse and eval here. – Abhi Apr 16 '16 at 18:09
  • Unless I misunderstand your example, I think the code should work just as well with long strings as with single letters. The role of `parse` and `eval` is not really important in the string comparison process here—the truly important function is `intersect(a, b)`, which will work with a vector of any type. – radiumhead Apr 17 '16 at 19:15
  • If you mean that you need to separate "the iron metal" into a vector containing "the", "iron" and "metal", then you can use `strsplit("the iron metal", split = " ")`. – radiumhead Apr 17 '16 at 19:18
0

I used the info here and HERE to write the below function, just input your data frame and column numbers.

# x = data /// y = number of column in data for string 1 // x =  number of column in data for string 2 // 


    string_matcher <- function(x, y, z) {

      data <- x
      char.x <- as.matrix(strsplit(as.character(data[,y]), ""))
      char.y <- as.matrix(strsplit(as.character(data[,z]), ""))


      stored_vector <- as.matrix(sapply(1:nrow(data), function(i) 2 * length(intersect(char.x[[i]], char.y[[i]])) / 
                                          (length(char.x[[i]]) + length(char.y[[i]]))))

       return(stored_vector)
    }
Alex Bădoi
  • 830
  • 2
  • 9
  • 24