Create a distance matrix in R using parallelization

Question

I have N vectors containing the cumulative frequencies of tweets, for clarification one of these vectors would like (0, 0, 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, ...)

I wanted to visualize the differences in these frequencies by creating a heat map. For that I first wanted to create an NxN Matrix containing the euclidean distances between tweets. My first approach is rather Java like and looks like this:

create_dist <- function(x){
  n <- length(x)                             #number of tweets
  xy <- matrix(nrow=n, ncol=n)               #create NxN matrix
  colnames(xy) <- names(x)                   #set column
  rownames(xy) <- names(x)                   #and row names

  for(i in 1:n) {
    for(j in 1:n){
      xy[i,j] <- distance(x[[i]], x[[1]])    #calculate euclidean distance for now, but should be interchangeable 
    }
  }

  xy
}

I measured the time it takes to create this distance matrix, and for a small sample (around two thousand tweets) it already takes about 35 seconds.

> system.time(create_dist(cumFreqs))
user  system elapsed 
34.572   0.000  34.602

Now I thought about how I could speed up the calculation a little bit and because my computer has 8 cores I thought maybe if I use parallelization it's going to be faster.

Like the R novice I am I changed the inner for loop to a foreach loop.

#libraries
library(foreach)
library(doMC)
registerDoMC(4)

create_dist <- function(x){
  n <- length(x)                                #number of tweets
  xy <- matrix(nrow=n, ncol=n)                  #create NxN matrix
  colnames(xy) <- names(x)                      #set column
  rownames(xy) <- names(x)                      #and row names

  for(i in 1:n) {
    xy[i,] <- unlist(foreach(j=1:n) %dopar% {   #set each row of the matrix
      distance(x[[i]], x[[j]])
    })
  }

  xy
}

Again I wanted to measure the time it takes to create a distance matrix for a sample of two thousand tweets using system.time(), but I cancelled the execution after 10 minutes because obviously there isn't a speed up at all.

I googled for solutions, but unfortunately I haven't found any. Now I wanted to ask you if there is a better way to create this distance matrix, maybe an apply function, which I have no shame admit still confuse me.

Why you don't use `?dist`? Should be a lot of faster than your solution. — sgibb, Jun 16 '13 at 12:19
I believe you would get better performance, if you parallelized the outer loop and not the inner loop. To get a benefit, even though there is parallelization overhead, each iteration needs to be performance intensive. However, I believe you can get rid of all explicit R loops in your code (see comment by @sgibb). — Roland, Jun 16 '13 at 12:21
Or, you could write the distance calculation in C++, and incorporate it into R using the `inline` package. — Paul Hiemstra, Jun 16 '13 at 12:22
I thought about using dist too, but the distance function I use should be interchangeable later. — , Jun 16 '13 at 12:24
Maybe you want to have a look at the [proxy](http://cran.r-project.org/web/packages/proxy/index.html) package. It supports 48 different distance measurements. The calculation is based on matrices and mostly very fast. — sgibb, Jun 16 '13 at 13:07
It's unclear to me what object `x` (e.g. `cumFreqs`) looks like. You say you have `N` vectors, but your first function calculates distance between 2 vectors (i.e. between vector `x` and `x`). Is that correct? And what would your object, that is a sample of 2000 tweets, look like? (a vector of length 2000?) — jbaums, Jun 16 '13 at 13:55
class(cumFreqs) tells me that it's a list. cumFreqs[[1]] would be an integer vector of length 100 containing the frequencies for the first tweet, cumFreqs[[2]] would be a vector of length 100 containing the frequencies for the second tweet, etc. until cumFreqs[[2000]] — , Jun 16 '13 at 14:09

score 2 · Accepted Answer · answered Jun 16 '13 at 13:03

As mentioned you can use dist function. Here an example of how to use the result of dist to create a heatmap.

nn <- paste0('row',1:5)
x <- matrix(rnorm(25), nrow = 5,dimnames=list(nn))
distObj <- dist(x)
cols <- c("#D33F6A", "#D95260", "#DE6355", "#E27449", 
            "#E6833D", "#E89331", "#E9A229", "#EAB12A", "#E9C037", 
            "#E7CE4C", "#E4DC68", "#E2E6BD")
## mandatory coercion
distObj <- as.matrix(distObj)
## hetamap
image(distObj[order(nn), order(nn)], col = cols, 
      xaxt = "n", yaxt = "n")
## axes labels
axis(1, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)
axis(2, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)

enter image description here

So, using your `cumFreqs` list of vectors, you could do: `x <- do.call(rbind, cumFreqs)`, followed by `distObj <- dist(x)`. With 2000 vectors of length 100, this takes just a couple of seconds. — jbaums, Jun 16 '13 at 14:59

score 0 · Answer 2 · edited May 23 '17 at 12:29

Like 'agstudy' suggests, use the builtin 'dist' function.

For future reference, nested for loops in R are pretty slow. As R is a functional language, try and use vectorised operations with functions such as the apply family (apply, lapply, sapply, tapply). It takes some time to think about programming tasks in a functional way when you're used to a C-like paradigm.

A useful discussion on benchmarks between for loops and apply flavours is here: Is R's apply family more than syntactic sugar?

Create a distance matrix in R using parallelization

2 Answers2

Linked