Pairwise sequence analysis - finding indexes of unique combinations

Question

I have a large list of DNA sequences {A,C,T,G} (total of 100,000 lists, each with 3000 characters). I need to analyse these lists in pairs, starting with the 1st list and comparing it with the 2nd, 3rd, 4th, ..., 100,000th. Then move on to the 2nd list and compare it with the 3rd, 4th, ..., 100,000th and so on.

In each pairwise comparison, I need to find the indexes of unique combinations of elemets. For example:

List1 = "A", "C", "A", "G", "T", "A", "C", "T", "C".

List2 = "A", "G", "G", "G", "C", "A", "G", "G", "C".

My desired output is:

AA = {1, 6}

CG = {2, 7}

AG = {3}

GG = {4}

TC = {5}

TG = {8}

CC = {9}

I have tried coding this using Rcpp with for loops and if/else statements, but it turns out to be quite slow. Using R functions such as apply, unique, etc. seem to perform even slower! I even tried coding these characters using Integers but didn't notice an improvement.

Just wondering if anyone can think of a quicker way to do it...

Thanks!

What exactly did your attempts look like. What does "quite slow" mean exactly? What are your speed requirements? — MrFlick, Feb 11 '20 at 04:37
Could you also throw in a third list (`List3`) and your desired output? — Edward, Feb 11 '20 at 07:21

Oliver · Answer 1 · 2020-02-11T11:26:25.670

0

Assuming they are actually lists, you could do something like

library(data.table)
Df <- data.table(list1, list2) 
Df[, .(str = factor(paste0(list1, list2)), 
       row = seq(.N))][, .(str, 
                          paste0(row, collapse = ',')), #collapse ID
                         by = str]

For each list pair and then combine the result.

edited Feb 11 '20 at 11:26

answered Feb 11 '20 at 07:50

Oliver

8,169
3
15
37

Pairwise sequence analysis - finding indexes of unique combinations

1 Answers1