R - how to find longest duplicate sequences and their frequencies

Question

I have some data that looks like this:

29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43

There is actually a lot more data, but I wanted to keep this short. I'd like to find a way in R to find the longest common subsequence for all rows and sort by the frequency (decreasing) where only those common subsequences that have more than one element in the sequence and more than one frequency is reported. Is there a way to do this in R?

So example result would be something like:

[29] 3
[30] 2 
...
( etc for all the single duplicates across each row and their frequencies )
...
[46  47] 2
[39  40  43] 3
[40, 43] 2

What exactly do you mean by "longest common subsequence for all rows"? — PejoPhylo, Sep 14 '17 at 17:20
@Nena It wasn't super clear what you were asking. Could you see the output of my answer is consistent with what you wanted. — CPak, Sep 14 '17 at 22:44
longest common subsequences across all rows would mean: as shown in the example, all of the numbers in common between each row and the number of times the combination is repeated for all rows. Assume the rows are sorted in increasing order. Does that make sense? EX: [39, 40, 43] is repeated 3 times. Though it could also be true that [39, 40] is repeated, but since the longest combination is [39, 40, 43] take that one since it is the longest. Hope that makes sense — Nena, Sep 20 '17 at 17:37

score 0 · Answer 1 · answered Sep 14 '17 at 22:23

Seems like you are asking two different kinds of questions. You want 1) length of contiguous runs of a single value columnwise and 2) count (non-contiguous) of ngrams (made rowwise) but counted columnwise.

library(tidyverse)
# single number contiguous runs by column
single <- Reduce("rbind", apply(df, 2, function(x) tibble(val=rle(x)$values, occurrence=rle(x)$lengths) %>% filter(occurrence>1)))

Output of single

    val occurrence
  <int>      <int>
1    29          3
2    30          2
3    40          2
4    43          2
5    43          2

# ngram numbers by row (count, non-contiguous)
restof <- Reduce("rbind", lapply(1:(ncol(df)-1), function(z) {
    nruns <- t(apply(df, 1, function(x) sapply(head(seq_along(x),-z), function(y) paste(x[y:(y+z)], collapse=" "))) )
    Reduce("rbind", apply(nruns, 2, function(x) tibble(val=names(table(x)), occurrence=c(table(x))) %>% filter(occurrence>1)))
}))

Output of ngrams

       val occurrence
     <chr>      <int>
1    39 40          2
2    46 47          2
3    40 43          3
4 39 40 43          2

Combining the data

ans <- rbind(single, restof)

Output

       val occurrence
     <chr>      <int>
1       29          3
2       30          2
3       40          2
4       43          2
5       43          2
6    39 40          2
7    46 47          2
8    40 43          3
9 39 40 43          2

Your data

df <- read.table(text="29  32  33  46  47  48
29  34  35  39  40  43
29  35  36  38  41  43
30  31  32  34  36  49
30  32  35  40  43  44
39  40  43  46  47  50
 7  8    9  39  40  43
 1  7    8  12  40  43")

I see `39 40 43` 3 times in the source though. Is that expected to have 2 in the output ? (I know OP displayed 2 in expected output but thought of a mistake) — moodymudskipper, Sep 14 '17 at 22:34
`40 43` is 2 times without `39` in the data, expected output from OP also gives 2, you return 3 — moodymudskipper, Sep 14 '17 at 22:36
Yeah, it's a weird situation. You still need to count columnwise even for ngrams. That is for the trigram, `39 40 43`, how many are in a column (indexing by first element?)...anyways, it was consistent with what OP's output even if OP's description wasn't super clear. — CPak, Sep 14 '17 at 22:37
See my second comment, it's not all consistent. As I read it there is nothing to do columnwise, but not super clear as you say — moodymudskipper, Sep 14 '17 at 22:44
Yes, I see your point. Further clarification is required. Thanks for pointing out the discrepancy. — CPak, Sep 14 '17 at 22:48
Yes, sorry I updated it. 39 40 43 occurs 3 times, i've updated the OP — Nena, Sep 20 '17 at 17:35

R - how to find longest duplicate sequences and their frequencies

1 Answers1