Large distance matrix in clustering

Question

I am running R 3.2.3 on a machine with 16 GB RAM. I have a large matrix of 3,00,000 rows x 12 columns. I wanna use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix. Since data is of mixed type, I use different matrices for different type. I get an error about memory allocation:

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

I get the following error

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)

You don't need to process all of your data together which will consume all of your memory and error out. Consider to handle them batch by batch such as 10000 vectors per time. — Patric, Dec 15 '15 at 05:21
But in clustering, we need to calculate distance from one row to all other rows. So how would computation in batches help here? — Kanika Singhal, Dec 15 '15 at 05:23
Yes, but you can do final reduction to select the min/max one. Does this make sense? For high efficient compute distance, you can refer [here](http://stackoverflow.com/questions/27847196/distance-calculation-on-large-vectors-performance/33409695#33409695). — Patric, Dec 15 '15 at 05:25
reduction by selecting min/max?? sorry I didn't understand. The more detailed insight might help. — Kanika Singhal, Dec 15 '15 at 05:30
update an answer because I need to write up a few more word, does it clear for your? thanks. — Patric, Dec 15 '15 at 05:49
@Patric, That's how we are not using any built-in function for clustering. We are making our own. In the following answer, you meant by defining the centriods of clusters and then find out which row belongs to which cluster? — Kanika Singhal, Dec 16 '15 at 05:36
yes, that's not built-in function, but you can still use `dist`. alternatively, maybe some memory swap library can help you to swap data between RAM and disk but the performance will be slow. For big dataset, most of time we have to do trick like this, even it's a little ugly. — Patric, Dec 16 '15 at 05:41

score 2 · Answer 1 · edited May 23 '17 at 12:17

To simple, let assume you have 1 row (A) to cluster with 3^8 matrix (B) by minimum distance.

The original approach is :

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

But because of B is really large, you can't load it to memory or error out during execution.

The batched approaches will like this:

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

The code will be a little complicated than original one. But this is very common trick when we deal with big data problem. For compute parts, you can refer one of my previous answers in here.

I will be very appropriate if you can paste your final code with batch mode in here. So that others can study as well.

Another interesting things about dist is that it is the few one in R package supporting openMP. See source code in here and how to compile with openMP in here.

So, if you can try set OMP_NUM_THREADS with 4 or 8 based on your machine and then run again, you can see the performance improvement a lot!

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

Furthermore, if you want to accelerate dist by GPU, you can refer talk part in ParallelR.

Large distance matrix in clustering

1 Answers1

Linked