Calculate euclidean distance in a faster way

Question

I want to calculate the euclidean distances between rows of a dataframe with 30.000 observations. A simple way to do this is the dist function (e.g., dist(data)). However, since my dataframe is large, this takes too much time.

Some of the rows contain missing values. I do not need the distances between rows, where both rows contain missing values, or between rows, where none of the rows contains missing values.

In a for-loop, I tried to exclude the combinations that I do not need. Unfortunately, my solution takes even more time:

# Some example data
data <- data.frame(
  x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
  x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
  x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
  x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
)


# Measure speed of dist() function
start_time_dist <- Sys.time()

# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)

end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist


# Measure speed of my own loop
start_time_own <- Sys.time()

# Calculate euclidean distance with my own loop only for specific cases

# # # 
# The following code should be faster!
# # # 

data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]

distance_list <- list()

for(i in 1:nrow(data_miss)) {

  distances <- numeric()
  for(j in 1:nrow(data_cc)) {
    distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
  }

  distance_list[[i]] <- distances
}

end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own


# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs

Is there a faster way how I could calculate the euclidean distances that I need?

dist is implemented in C, of course it is faster than an R for loop. You should implement your loop in Rcpp. — Roland, Sep 24 '16 at 18:38

Alexander Borochkin · Accepted Answer · 2016-09-24T19:04:32.350

I recommend you to use parallel computations. Put all your code in one function and do it parallel.

R will do all calculation in one thread by default. You should add parallel threads manually. Starting clusters in R will take time, but if you have large data frame, the performance of the main job will be (your_processors_number-1) times faster.

This links may also help: How-to go parallel in R – basics + tips and A gentle introduction to parallel computing in R.

Good choice is to divide your job into smaller packes and calculate them separately in each thread. Create threads only once, because it is time consuming in R.

library(parallel)
library(foreach)
library(doParallel)
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system
start.t.total<-Sys.time()
print(start.t.total)
startt<-Sys.time()
print(startt)
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
registerDoParallel(cl)
# results will be in out.df class(dataframe)
out.df<-foreach(p=1:no_cores
                    ,.combine=rbind # data from different threads will be in one table
                    ,.packages=c()# All packages that your funtion is using must be called here
                    ,.inorder=T) %dopar% #don`t forget this directive
                    {
                      tryCatch({
                          #
                          # enter your function here and do what you want in parallel
                          #
                          print(startt-Sys.time())
                          print(start.t.total-Sys.time())
                          print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
                        }
                        out.df
                      },error = function(e) return(paste0("The variable '", p, "'", 
                                                          " caused the error: '", e, "'")))
                    }
stopCluster(cl)
gc()# force to free memory from killed processes

Thank you very much for your answer, that helps me a lot! I did not even know that this is possible with R and will try to implement your solution! — Joachim Schork, Sep 24 '16 at 20:34
I think `amap` package could be helpful here, if you don't want to create your own functions, check this [answer](http://stackoverflow.com/a/25767588/6327771) out — Kasia Kulma, Mar 20 '17 at 12:27

Calculate euclidean distance in a faster way

1 Answers1