-2

I have a large list of DNA sequences {A,C,T,G} (total of 100,000 lists, each with 3000 characters). I need to analyse these lists in pairs, starting with the 1st list and comparing it with the 2nd, 3rd, 4th, ..., 100,000th. Then move on to the 2nd list and compare it with the 3rd, 4th, ..., 100,000th and so on.

In each pairwise comparison, I need to find the indexes of unique combinations of elemets. For example:

List1 = "A", "C", "A", "G", "T", "A", "C", "T", "C".

List2 = "A", "G", "G", "G", "C", "A", "G", "G", "C".

My desired output is:

AA = {1, 6}

CG = {2, 7}

AG = {3}

GG = {4}

TC = {5}

TG = {8}

CC = {9}

I have tried coding this using Rcpp with for loops and if/else statements, but it turns out to be quite slow. Using R functions such as apply, unique, etc. seem to perform even slower! I even tried coding these characters using Integers but didn't notice an improvement.

Just wondering if anyone can think of a quicker way to do it...

Thanks!

Sudaraka
  • 125
  • 7

1 Answers1

0

Assuming they are actually lists, you could do something like

library(data.table)
Df <- data.table(list1, list2) 
Df[, .(str = factor(paste0(list1, list2)), 
       row = seq(.N))][, .(str, 
                          paste0(row, collapse = ',')), #collapse ID
                         by = str] 

For each list pair and then combine the result.

Oliver
  • 8,169
  • 3
  • 15
  • 37