0

I have a dataframe composed of groups with corresponding animals as a string:

data = data.frame(group = c(1,2,3,4), animal = c("cat, dog, horse, mouse", "cat, dog, horse", "cat, dog,", "cat, dog, frog, cow"))

I would like to return groups where the longest string match. In this example, group 1 and 2 would match, as 3 out of 4 animals are the same. Also group 2 and 3 would match as 2 out of 3 animals match. But, no groups with match with group 4 because only 2 out of 4 animals match.

I'd like to return a dataframe like this showing the matching groups:

group_a group_b
   1       2
   2       3

I'm not sure if this is possible. I've looked into how to match partial strings but struggling to find similar examples to adapt. Any ideas?

Thank you.

LHordley
  • 95
  • 5
  • Can you explain the grouping on how group_a and group_b are formed in a different way? It looks like possible, but needs some clarity. – Mohanasundaram Apr 27 '20 at 13:22
  • Not entirely sure what you mean. But the idea is that when two groups have the longest matching string, it would return the numbers of the two matching groups. Does that help? – LHordley Apr 27 '20 at 13:24
  • Understood! Do you want to create as many matching groups as possible? And in group 4, cat, dog are matching with three other groups why can't one of those be selected? Do you want to have matching for more half of the number of animals? – Mohanasundaram Apr 27 '20 at 13:27
  • Yes as many matching groups as possible, but ideally I'd only like them to match when is one animal missing. I.e. 3 out of 4, or 8 out of 9 (my dataset goes up to 9 animals). – LHordley Apr 27 '20 at 13:32
  • Do you want to actually match substrings? Or will your use case be essentially about matching words? –  Apr 27 '20 at 13:50
  • Matching words essentially - sorry for the confusion! – LHordley Apr 27 '20 at 13:53

2 Answers2

1

Is this what you are looking for?

lst <- regmatches(data$animal,gregexpr("\\w+",data$animal))

u <- lapply(seq_along(lst)[-length(lst)], 
            function(p) subset(data.frame(group_a = p,
                                          group_b = seq_along(lst)[-(1:p)],
                                          longestmatch = sapply(seq_along(lst)[-(1:p)], 
                                                                function(q) length(intersect(lst[[p]],lst[[q]])))),
                               longestmatch == max(longestmatch)))

res <- do.call(rbind,c(make.row.names = FALSE,u))

such that

> res
  group_a group_b longestmatch
1       1       2            3
2       2       3            2
3       2       4            2
4       3       4            2
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

Given your comment about wanting to match words, here is the start of a text approach that might be helpful. Basically we want to split out each word and count the occurrence in each statement.

library(tidytext)
library(dplyr)
library(tidyr)

dtm <- data %>%
  unnest_tokens("word", "animal", token = "regex", pattern = ",") %>% 
  mutate(word = str_trim(word)) %>%
  count(group, word) %>% 
  pivot_wider(names_from = "word", values_from = "n", values_fill = list(n = 0)) 

What you know have is a document term matrix. We have now changed your problem from a regex matching one to finding the vectors with the most matches.

# A tibble: 4 x 7
  group   cat   dog horse mouse   cow  frog
  <dbl> <int> <int> <int> <int> <int> <int>
1     1     1     1     1     1     0     0
2     2     1     1     1     0     0     0
3     3     1     1     0     0     0     0
4     4     1     1     0     0     1     1

An easy thing to do would be to extract the matrix part and just multiply.

mat <- as.matrix(select(dtm, -group))
matches <- (mat %*% t(mat))

This would give you a matrix for each group matches. For example, row 1, column 2 shows the three word matches (cat, dog, and horse) between groups 1 and 2.

matches
     [,1] [,2] [,3] [,4]
[1,]    4    3    2    2
[2,]    3    3    2    2
[3,]    2    2    2    2
[4,]    2    2    2    4

Then you can play with things from there. For example, pulling the row and column IDs and then the upper triangular part of the matrix can give you a summary. I think from here it is just a matter of how you want to filter the table.

data.frame(row = c(row(matches)),
           col = c(col(matches)),
           value = c(matches),
           upper = c(upper.tri(matches))) %>% 
  filter(upper == TRUE)

  row col value upper
1   1   2     3  TRUE
2   1   3     2  TRUE
3   2   3     2  TRUE
4   1   4     2  TRUE
5   2   4     2  TRUE
6   3   4     2  TRUE