I have N vectors containing the cumulative frequencies of tweets, for clarification one of these vectors would like (0, 0, 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, ...)
I wanted to visualize the differences in these frequencies by creating a heat map. For that I first wanted to create an NxN Matrix containing the euclidean distances between tweets. My first approach is rather Java like and looks like this:
create_dist <- function(x){
n <- length(x) #number of tweets
xy <- matrix(nrow=n, ncol=n) #create NxN matrix
colnames(xy) <- names(x) #set column
rownames(xy) <- names(x) #and row names
for(i in 1:n) {
for(j in 1:n){
xy[i,j] <- distance(x[[i]], x[[1]]) #calculate euclidean distance for now, but should be interchangeable
}
}
xy
}
I measured the time it takes to create this distance matrix, and for a small sample (around two thousand tweets) it already takes about 35 seconds.
> system.time(create_dist(cumFreqs))
user system elapsed
34.572 0.000 34.602
Now I thought about how I could speed up the calculation a little bit and because my computer has 8 cores I thought maybe if I use parallelization it's going to be faster.
Like the R novice I am I changed the inner for loop to a foreach loop.
#libraries
library(foreach)
library(doMC)
registerDoMC(4)
create_dist <- function(x){
n <- length(x) #number of tweets
xy <- matrix(nrow=n, ncol=n) #create NxN matrix
colnames(xy) <- names(x) #set column
rownames(xy) <- names(x) #and row names
for(i in 1:n) {
xy[i,] <- unlist(foreach(j=1:n) %dopar% { #set each row of the matrix
distance(x[[i]], x[[j]])
})
}
xy
}
Again I wanted to measure the time it takes to create a distance matrix for a sample of two thousand tweets using system.time(), but I cancelled the execution after 10 minutes because obviously there isn't a speed up at all.
I googled for solutions, but unfortunately I haven't found any. Now I wanted to ask you if there is a better way to create this distance matrix, maybe an apply function, which I have no shame admit still confuse me.