I have a csv file containing around 9000 number sequences which I need to cluster. The first 6 rows of the csv look like this
id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"
My R code that performs clustering looks like this
seqsim <- function(seq1, seq2){
seq1 <- as.character(seq1)
seq2 <- as.character(seq2)
s1 <- get1grams(seq1)
s2 <- get1grams(seq2)
intersection <- intersect(s1,s2)
if(length(intersection)==0){
return (1)
}
else{
u <- union(s1, s2)
score = length(intersection)/length(u)
return (1-score)
}
}
###############
mydata <- read.csv("sequence.csv")
mydatamatrix <- as.matrix(mydata$sequence)
# take the data in csv and create dist matrix
rownames(mydatamatrix) <- mydata$id
distance_matrix <- dist_make(mydatamatrix, seqsim, "SeqSim (custom)")
clusters <- hclust(distance_matrix, method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
# clustercut contains the clusterIDs assigned to each sequence or row of the input dataset
# Number of members in each cluster
table(mydata$id,clusterCut)
write.csv(clusterCut, file = "clusterIDs.csv")
The code works for small number of sequences like around 900 but I encounter memory issues for larger datasets.
My question is: Am I doing clustering the right way? Are there faster and memory efficient ways to handle the clustering of this kind of data using R? The function seqsim is actually returning the distance and not the similarity because I am returning 1-score. Seqsim is calling other methods which I have left out to reduce the length of the code.