0

My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance. I understand that these would lead to a matrix of 510^5 * 5*10^5 elements.

I have tried so far the following packages in our HPC but none of them can handle the size of the matrix. levenR, Biostrings::stringDist, stringdist::stringdistmatrix, tidystringdist

library(tidyverse)

x <- c("T", "A", "C", "G")
data <- expand.grid(rep(list(x), 5)) %>% 
  unite("sequences", 1:5, sep="")

head(data)
#>   sequences
#> 1     TTTTT
#> 2     ATTTT
#> 3     CTTTT
#> 4     GTTTT
#> 5     TATTT
#> 6     AATTT

Created on 2022-02-22 by the reprex package (v2.0.1)

Is there any trick that I can follow to achieve my goal for counting lv distance? Can I parallelise the process and if yes how? Would it make sense?

I appreciate your time. Any guidance and help are highly appreciated

LDT
  • 2,856
  • 2
  • 15
  • 32
  • 1
    Back of the envelope calculation: 8 bytes per matrix element means a memory requirement to store in RAM of around `5e5 * 5e5 * 8 / 1024^3` GB = 1.8 TB. There's some additional overhead and the Levenshtein distance is symmetric so you'll only need to compute the upper or lower triangular matrix but I am not sure this is tractable (even taking parallelisation and HPC envs into account). I would think about reducing the number of strings first. – Maurits Evers Feb 22 '22 at 22:44
  • Two ideas/questions? Is it possible that your sequences could contain duplicates, i.e having several rows with the same sequence? If so, deduplicating could potentially considerably reduce the size of your problem. The second idea would be to tun this in chunks, i.e first calculate the stringdist of all sequences compared to the first 100 values, then against 101-200 and so on. So basically a for loop. It will still run for quite some time, but might not crash anymore. – deschen Feb 22 '22 at 22:46
  • Dear deschen, thats a good point. However I really want to everything with everything. If i use a window I ll perhaps miss some comparisons. Maybe I am thinking wrongly – LDT Feb 23 '22 at 07:54

0 Answers0