R - How to speed up Euclidean distance calculation on a very large dataset

Question

community,

I have a very large dataset containing 3 columns with coordinates (x, y, z) and 24 x 10^6 rows. I need to calculate the euclidean distance between all rows and the first row which is 0, 0, 0. With the loop below this takes a very long time! I have also tried this also on a matrix instead of a dataframe, but that did not solve the problem.

Does anyone have suggestions to speed up this process?

library(cluster)

e <- list() # list to be filled with euclidean distances

for (r in 1:(nrow(pca.123.df))) {

  eucl.dist <- daisy(pca.123.df[c(1,r), ], metric = "euclidean") # Euclidean distance between anomaly and zero (row 1)

  e[[r]] <- eucl.dist[1]

}

score 4 · Answer 1 · answered Nov 07 '14 at 09:02

Use the formula for the Euclidean distance.

A reproducible example of your code:

library(cluster)
set.seed(42)
DF <- as.data.frame(rbind(0, matrix(rnorm(15), ncol=3))) 

e <- list() # list to be filled with euclidean distances

for (r in 1:(nrow(DF))) {

  eucl.dist <- daisy(DF[c(1,r), ], metric = "euclidean") # Euclidean distance between anomaly and zero (row 1)

  e[[r]] <- eucl.dist[1]

}
# [[1]]
# [1] 0
# 
# [[2]]
# [1] 1.895646
# 
# [[3]]
# [1] 2.79863
# 
# [[4]]
# [1] 1.438665
# 
# [[5]]
# [1] 2.133606
# 
# [[6]]
# [1] 0.4302796

A vectorized solution:

sqrt(colSums((t(DF)-unlist(DF[1,]))^2))
#[1] 0.0000000 1.8956461 2.7986300 1.4386649 2.1336055 0.4302796

Using the knowledge that the first row is all zeros:

sqrt(rowSums(DF^2))
#1] 0.0000000 1.8956461 2.7986300 1.4386649 2.1336055 0.4302796

Thanks, efficient solution! – Niels Raes Nov 07 '14 at 09:39 — Niels Raes, Nov 07 '14 at 09:39

R - How to speed up Euclidean distance calculation on a very large dataset

1 Answers1