1

I have a csv file containing around 9000 number sequences which I need to cluster. The first 6 rows of the csv look like this

id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"

My R code that performs clustering looks like this

seqsim <- function(seq1, seq2){
  seq1 <- as.character(seq1)
  seq2 <- as.character(seq2)
  s1 <- get1grams(seq1)
  s2 <- get1grams(seq2)
  intersection <- intersect(s1,s2)
  if(length(intersection)==0){
    return (1)
  }
  else{
    u <- union(s1, s2)
    score = length(intersection)/length(u)
    return (1-score)
  }  
}      
###############   
mydata <- read.csv("sequence.csv")
mydatamatrix <- as.matrix(mydata$sequence) 

# take the data in csv and create dist matrix    
rownames(mydatamatrix) <- mydata$id
distance_matrix <- dist_make(mydatamatrix, seqsim, "SeqSim (custom)")
clusters <- hclust(distance_matrix,  method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
# clustercut contains the clusterIDs assigned to each sequence or row of the input dataset    
# Number of members in each cluster
table(mydata$id,clusterCut)    
write.csv(clusterCut, file = "clusterIDs.csv")

The code works for small number of sequences like around 900 but I encounter memory issues for larger datasets.

My question is: Am I doing clustering the right way? Are there faster and memory efficient ways to handle the clustering of this kind of data using R? The function seqsim is actually returning the distance and not the similarity because I am returning 1-score. Seqsim is calling other methods which I have left out to reduce the length of the code.

Shamsa
  • 67
  • 1
  • 8
  • Consider porting code to a C module. Pure R is quite slow, so this can improve things considerably (and many good R modules are mostly written in Fortran, C, C++, etc.) – Has QUIT--Anony-Mousse Dec 19 '18 at 18:12
  • At the core your distance function looks to be computing the jaccard distance. If you convert your data.frame of sequences to an incidence matrix you can then use `dist`, `proxy::dist` or some matrix operations – emilliman5 Dec 19 '18 at 18:40

1 Answers1

0

I suspect/assume the bottleneck is the distance calculation and not the clustering per se

Here is how I would approach this:

  1. split the text processing from the distance calculation (this will prevent you from processing each string multiple times)
  2. use either R's dist function or use matrix operations to calculate the distance matrix (which is the jaccard index).
  3. Be careful trying to pot the results of clustering 9000 sequences, it will most certainly be undecipherable
  4. A 9000 x 9000 matrix is going to require a lot of memory so that may be the next bottleneck you need to overcome depending our your computer's memory resources.

The code:

library(arules)
df <- read.table(text='id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"', header=TRUE, sep=",")

seq <- lapply(df$sequence, get1grams) #I am assuming that get1grams produces a vector
names(seq) <- paste0("seq_", df$id)

seqTrans <- as(seq, "transactions") #create a transactions object
seqMat <- as(seqTrans, "matrix") #turn the transactions object into an incidence matrix each row represents a sequence and each column a 1gram each cell presence/absence of the 1gram
seqMat <- +(seqMat) #convert boolean to 0/1
j.dist <- dist(seqMat, method = "binary") #make use of base R's distance function

##Matrix multiplication to calculate the jaccard distance
tseqMat <- t(seqMat)
a <- t(tseqMat) %*% tseqMat
b <- t(matrix(rep(1, length(tseqMat)), nrow = nrow(tseqMat), ncol = ncol(tseqMat))) %*% tseqMat
b <- b - a
c <- t(b)
j <- as.dist(1-a/(a+b+c))

clusters <- hclust(j,  method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
Community
  • 1
  • 1
emilliman5
  • 5,816
  • 3
  • 27
  • 37
  • Yes the bottleneck is the distance matrix calculation. Using your code 500 rows of data take 0.8 seconds for the dist function, 1000 rows take 12.3 seconds and 3000 rows take 322.47 seconds. My dataset is of almost 9000 rows and I need to be able to scale more than that. Is there a way to use sparse matrices to speed up the calculation? – Shamsa Jan 01 '19 at 08:19
  • I used the parDist function and the distance matrix computation improved from 322 to 96 seconds for 3000 rows. – Shamsa Jan 01 '19 at 10:41
  • Using matrix multiplcation method is faster than using R's distance function (where parDist is faster than dist but slower than matrix method) – Shamsa Jan 02 '19 at 11:28