0

I am trying to calculate jaccard coefficient for a two mode network data.

My data looks like this:

df <- data.frame(patent = c("A", "B", "B", "C", "C", "C"),
                 class = c("X", "Y", "Z", "X", "Y", "Z"))

node_list <- 
  df %>% 
  select(class) %>% distinct(class)

edge_list <- as.data.frame(t(combn(node_list,2)))
edge_list$no_patents_V1 <- NA
edge_list$no_patents_V2 <- NA
edge_list$no_patents_V1_V2 <- NA
edge_list$no_patents_V1_nV2 <- NA

I need to calculate edge weights. My edge weights are: I need to find how many patents belong to class 1 and class 2, class 1 but not 2, class 2 but not 1. Then I calculate jaccard coeff as a/a+b+c.

Also I need totals for how many patents belong to each of unique classes.

I tried following code:

`for(k in 1:nrow(edge_list)){
      
      edge_list[k,"no_patents_V1"] <-
        df%>% 
        filter(str_detect(classes, edge_list[k,1])) %>%
        nrow()
      
      edge_list[k,"no_patents_V2"] <-
        df%>% 
        filter(str_detect(classes, edge_list[k,2])) %>%
        nrow()
      
      edge_list[k,"no_patents_V1_V2"] <-
        df%>% 
        filter(str_detect(classes, edge_list[k,1])) %>%
        filter(str_detect(classes, edge_list[k,2])) %>%
        nrow()
      
      edge_list[k,"no_patents_V1_nV2"] <-
        df%>% 
        filter(str_detect(classes, edge_list[k,1])) %>%
        filter(!str_detect(classes, edge_list[k,2])) %>%
        nrow()
      
      edge_list[k,"no_patents_V2_nV1"] <-
        df%>% 
        filter(str_detect(classes, edge_list[k,2])) %>%
        filter(!str_detect(classes, edge_list[k,1])) %>%
        nrow()
    }
`

I have total 30 classes and hence 435 rows in edge list. This is super inefficient. Can you suggest some efficient way to solve this?

I have total of about one million patents.

1 Answers1

0

This might be what you are looking for, done in base R. Create data:

df <- data.frame(patent = c("A", "B", "B", "C", "C", "C"),
                 class = c("X", "Y", "Z", "X", "Y", "Z"))

node_list <- unique(df$class)
edge_list <- as.data.frame(t(combn(node_list, 2)))

Note that the original code returns an error when creating edge list. Loop over each row in edge_list:

for(i in 1:nrow(edge_list)) {
  V1 <- df[df$class == edge_list[i,]$V1,]$patent
  V2 <- df[df$class == edge_list[i,]$V2,]$patent
  edge_list$V1_and_V2[i] <- length(intersect(V1, V2))
  edge_list$V1_not_V2[i] <- length(setdiff(V1, V2))
  edge_list$V2_not_V1[i] <- length(setdiff(V2, V1))
}

We utilise set comparisons with intersect and setdiff to simplify the task, no need to look for strings. I normally don't like for loops, but it should be sufficient in this case. If not, you could put it into a sapply call. Output:

edge_list
V1 V2 V1_and_V2 V1_not_V2 V2_not_V1
X  Y         1         1         1
X  Z         1         1         1
Y  Z         2         0         0

Add-on: For really big datasets, we may parallelise with future.apply and wrap the code in a future_sapply call. Given a large data.frame of 5 million rows and 435 unique combinations of two classes:

df <- data.frame(patent = sample(1:1000, 5000000, replace = TRUE),
                 class = sample(1:30, 5000000, replace = TRUE))
node_list <- unique(df$class)
edge_list <- as.data.frame(t(combn(node_list, 2)))

Using future_sapply:

library(future.apply)
plan(multisession, workers = 10)
proc <- future_sapply(1:nrow(edge_list), function(x) {
  V1 <- df[df$class == edge_list$V1[x],]$patent
  V2 <- df[df$class == edge_list$V2[x],]$patent
  V1_and_V2 <- length(intersect(V1, V2))
  V1_not_V2 <- length(setdiff(V1, V2))
  V2_not_V1 <- length(setdiff(V2, V1))
  return(c(V1_and_V2, V1_not_V2, V2_not_V1))
}, future.seed = TRUE)
edge_list <- cbind(edge_list, t(proc))
names(edge_list)[3:5] <- c("V1_and_V2", "V1_not_V2", "V2_not_V1")

Microbenchmark:

Unit: seconds
          expr       min        lq      mean    median        uq       max neval cld
      for_loop 19.003180 20.485825 20.582161 20.734163 20.943270 21.081319    10   b
 future_sapply  4.116312  4.216228  4.259608  4.265985  4.327765  4.368124    10  a 
thesixmax
  • 164
  • 1
  • 9
  • Thanks so much. This looks more efficient than my code. I am not sure if this can by done using apply or lapply. Then for loop could be avoided and hence speed would greatly improve. – Puneet Sachdeva Mar 01 '23 at 01:17
  • Thank you so much. This significantly reduced the runtime. If I find some way to vectorize the command, I will share in this thread. I think vectorization is possible, just maybe we need to do it in multiple steps or pre-process the df data frame. I have 30 classes so the number of rows in edge_list is 435. And I have a list of 4000 dfs on which I need to run this. – Puneet Sachdeva Mar 01 '23 at 04:12