0

I have a problem in doing sequence alignment/matching in R for lists. Let me explain better, my data are clickstream data and i have sequences divided in n-grams. The sequence looks something like

1. ABDCGHEI... NaNa
2. ACSNa.... NaNa

and so on where Na stays for "Not available", needed to match sequence lengths. Now i put all of these sequences in a list in a rude way like

dativec = as.vector(dataseq2)
for(i in 1:length(dativec)) {
  prova2[[i]] = dativec[i]
}
BigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
prova3 = lapply(prova2, BigramTokenizer)

and divided them in n-grams, e. g. bigrams looks like this:

[[1]] "A B" "B D" "D C".... "Na Na"
[[2]] "A C" "C S" .... "Na Na"

Now the challenge is : how can i match every bigram of each element of my list, with each bigram of the other elements in the list? I tried to use the Biostrings package but the function pairwiseAlignment only gives back a score for the first bigram of each element in the list, while i just need to know if they're identical or not, and i need it all comparisons not just the first elements. The desired result is the percentage of equal sub-ngrams without the information about positions. I only care about identity. I also tried to use setdiff function but apparently it doesn't work in the way i want.

Edited for more clarity

1 Answers1

0

You can use outer:

bigrams <- list (a = c("A B", "B D", "D C", "Na Na"),
                 b = c("A C", "C S", "Na Na"))

with(bigrams, outer(a, b, `==`))

##>       [,1]  [,2]  [,3]
##> [1,] FALSE FALSE FALSE
##> [2,] FALSE FALSE FALSE
##> [3,] FALSE FALSE FALSE
##> [4,] FALSE FALSE  TRUE

Stefano Barbi
  • 2,978
  • 1
  • 12
  • 11
  • So, then i just need to `table` it and then i'll take the percentage of equal subsequences in every pair right? – NicodemoXIII Jan 13 '23 at 11:04
  • I think you can use `sum` for that. However, if you just need the number of bigrams in common between two sequences, you'd better start from `%in%` or `intersect`. – Stefano Barbi Jan 13 '23 at 11:06
  • Ok thank you very much, i think it will work like this. In the end i just need to calculate a distance so the number of bigrams in common is enough. – NicodemoXIII Jan 13 '23 at 11:18