I am trying to calculate jaccard coefficient for a two mode network data.
My data looks like this:
df <- data.frame(patent = c("A", "B", "B", "C", "C", "C"),
class = c("X", "Y", "Z", "X", "Y", "Z"))
node_list <-
df %>%
select(class) %>% distinct(class)
edge_list <- as.data.frame(t(combn(node_list,2)))
edge_list$no_patents_V1 <- NA
edge_list$no_patents_V2 <- NA
edge_list$no_patents_V1_V2 <- NA
edge_list$no_patents_V1_nV2 <- NA
I need to calculate edge weights. My edge weights are: I need to find how many patents belong to class 1 and class 2, class 1 but not 2, class 2 but not 1. Then I calculate jaccard coeff as a/a+b+c.
Also I need totals for how many patents belong to each of unique classes.
I tried following code:
`for(k in 1:nrow(edge_list)){
edge_list[k,"no_patents_V1"] <-
df%>%
filter(str_detect(classes, edge_list[k,1])) %>%
nrow()
edge_list[k,"no_patents_V2"] <-
df%>%
filter(str_detect(classes, edge_list[k,2])) %>%
nrow()
edge_list[k,"no_patents_V1_V2"] <-
df%>%
filter(str_detect(classes, edge_list[k,1])) %>%
filter(str_detect(classes, edge_list[k,2])) %>%
nrow()
edge_list[k,"no_patents_V1_nV2"] <-
df%>%
filter(str_detect(classes, edge_list[k,1])) %>%
filter(!str_detect(classes, edge_list[k,2])) %>%
nrow()
edge_list[k,"no_patents_V2_nV1"] <-
df%>%
filter(str_detect(classes, edge_list[k,2])) %>%
filter(!str_detect(classes, edge_list[k,1])) %>%
nrow()
}
`
I have total 30 classes and hence 435 rows in edge list. This is super inefficient. Can you suggest some efficient way to solve this?
I have total of about one million patents.