Pairwise sequence list matching in R

Question

I have a problem in doing sequence alignment/matching in R for lists. Let me explain better, my data are clickstream data and i have sequences divided in n-grams. The sequence looks something like

1. ABDCGHEI... NaNa
2. ACSNa.... NaNa

and so on where Na stays for "Not available", needed to match sequence lengths. Now i put all of these sequences in a list in a rude way like

dativec = as.vector(dataseq2)
for(i in 1:length(dativec)) {
  prova2[[i]] = dativec[i]
}
BigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
prova3 = lapply(prova2, BigramTokenizer)

and divided them in n-grams, e. g. bigrams looks like this:

[[1]] "A B" "B D" "D C".... "Na Na"
[[2]] "A C" "C S" .... "Na Na"

Now the challenge is : how can i match every bigram of each element of my list, with each bigram of the other elements in the list? I tried to use the Biostrings package but the function pairwiseAlignment only gives back a score for the first bigram of each element in the list, while i just need to know if they're identical or not, and i need it all comparisons not just the first elements. The desired result is the percentage of equal sub-ngrams without the information about positions. I only care about identity. I also tried to use setdiff function but apparently it doesn't work in the way i want.

Edited for more clarity

Hello,I think a minimal reproducible example would help (example inputs and expected outputs) — Paul Stafford Allen, Jan 13 '23 at 10:26
`"Now i put all of these sequences in a list"` - share your codes. — zx8754, Jan 13 '23 at 10:28
I edited to try to be more clear, i did not share the code because it really works just for this case and it's very not elegant — NicodemoXIII, Jan 13 '23 at 11:05

score 0 · Accepted Answer · answered Jan 13 '23 at 10:26

0

You can use outer:

bigrams <- list (a = c("A B", "B D", "D C", "Na Na"),
                 b = c("A C", "C S", "Na Na"))

with(bigrams, outer(a, b, `==`))

##>       [,1]  [,2]  [,3]
##> [1,] FALSE FALSE FALSE
##> [2,] FALSE FALSE FALSE
##> [3,] FALSE FALSE FALSE
##> [4,] FALSE FALSE  TRUE

answered Jan 13 '23 at 10:26

Stefano Barbi

2,978
1
12
11

So, then i just need to `table` it and then i'll take the percentage of equal subsequences in every pair right? – NicodemoXIII Jan 13 '23 at 11:04
I think you can use `sum` for that. However, if you just need the number of bigrams in common between two sequences, you'd better start from `%in%` or `intersect`. – Stefano Barbi Jan 13 '23 at 11:06
Ok thank you very much, i think it will work like this. In the end i just need to calculate a distance so the number of bigrams in common is enough. – NicodemoXIII Jan 13 '23 at 11:18

Pairwise sequence list matching in R

1 Answers1