0

My question may be worded confusingly so I'll clarify. Lets say I have two datasets. The first one (DS1) is made up of 10 (x,y) coordinates. The second one (DS2) is made up of 20 (x,y) coordinates.

My goal is to find which point in DS2 is closest to each point of DS1. So I would end up with, in this example, 10 distances.

BTW, I already wrote a working function that does this. But its SLOW. I did a brute force method with 2 nested for loops. Is there an established algorithm or package that does this faster?

EDIT: People have asked to see my code. I apologize in advance for those of you who have taken a formal algorithms class.


Generate_distances_list <- function(standard_object, experimental_expression, dist.method = "Euclidean") {
  
  #Intialize distance_df
  ncol_df <- 2
  nrow_df <- nrow(standard_object$expression_ref)
  
  distance_df <- data.frame(matrix(ncol = ncol_df , nrow = nrow_df))
  distance_df[,ncol_df] <- 1:nrow_df
  
  #Initialize list of distances
  distance_df_list <- vector(mode = "list", length = nrow_df)
  
  #Experimental ncol and nrow
  nrow_experimental_expr <- nrow(experimental_expression)
  ncol_experimental_expr <- ncol(experimental_expression)
  
  for (i in 1:nrow_df) {
    
    for (j in 1:nrow_experimental_expr) {
      
      if (dist.method == "Euclidean") {
        
        distance_df[j, 1] <- dist(rbind(standard_object$expression_ref[i,], experimental_expression[j,]))[1]
        
      } else if (dist.method == "Manhattan") {
        
        vec <- vector(length = ncol_experimental_expr)
        
        for (k in 1:ncol_experimental_expr) {
          
          vec[k] <- abs(standard_object$expression_ref[i, k] - experimental_expression[j, k])
          
        }
        
        distance_df[j, 1] <- sum(vec)
      }
    }
    distance_df_list[[i]] <- distance_df
    
  }
  output <- distance_df_list
  output
}
  

Nova
  • 588
  • 4
  • 16
  • How are your data sets stored? Numeric vectors? `data.frame`? `data.table`? It seems like some combination of `vapply` and `min` would be pretty fast. Maybe you could post your slow function? – Eric Canton Aug 12 '20 at 20:20
  • Even with loops, I'm surprised that 10*20 = 200 calculations can be slow. Why do you need 3 nested loops, aren't 2 enough? – Waldi Aug 12 '20 at 20:33
  • library(tidyverse) df1 <- tibble( id1 = 1:10, x1 = runif(10, -30, 30), y1 = runif(10, -30, 30) ) df2 <- tibble( id2 = 1:20, x2 = runif(20, -30, 30), y2 = runif(20, -30, 30) ) df <- crossing(df1, df2) %>% mutate(dist = ((x1 - x2)^2 +(y1 - y2)^2)^(1/2) ) %>% group_by(id1) %>% slice_min(dist) – Jakub.Novotny Aug 12 '20 at 20:47

0 Answers0