How to identify repeated subsequences in a dataset

Question

I have a dataset of numerical values, each represent a zone.

eg.

x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)

I need to identify whether there are repeated subsequences within the data, i.e whether the subject repeatedly travelled from zone 1 to 2 to 3. In the above example 1,2,3 would give a value of 3. I don't know the subsequences already, I need R to provide this given the data.

Following that I need to calculate how many times this subsequence appears in the data.

Very basic knowledge or R so forgive me for my ignorance if this is a simple task!

would something like this work?? `library(stringr);table(gsub("_","",unlist(str_extract_all(str_c(x,collapse = "_"),"(\\w{4,})(?=.*\\1)")))) + 1`??? — Onyambu, Aug 12 '18 at 20:11

IceCreamToucan · Answer 1 · 2018-08-14T12:52:29.660

4

Here's a way to find which sequences of length n repeat, and how many times

For n = 3

library(tidyverse) # not necessary, see base version below

n <- 3
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% # get index of all subsequences
  map_chr(~ paste(x[.], collapse = ',')) %>% # paste together as character
  table %>% # get number of times each occurs
  `[`(. > 1) # select sequences occurring > 1 time
# 1,2,3 
# 3

For n = 2

n <- 2
lapply(seq(0, length(x) - n), `+`, seq(n)) %>% 
  map_chr(~ paste(x[.], collapse = ',')) %>% 
  table %>% 
  `[`(. > 1)
# 1,2 2,3 5,9 
# 3   3   2

Without Tidyverse

seqs <- lapply(seq(0, length(x) - n), `+`, seq(n))
seqs.char <- sapply(seqs, function(i) paste(x[i], collapse = ','))
tbl <- table(seqs.char)
tbl[tbl > 1]

I'll add my own question: Does anyone know how to do this without converting to character first? e.g. fun where fun(list(1:2, 1:2, 2:3)) tells you 1:2 occurs twice and 2:3 occurs once?

edited Aug 14 '18 at 12:52

answered Aug 12 '18 at 17:14

IceCreamToucan

28,083
2
22
38

sorry, I really am very new to R! I'm using with tidyverse. but I receive the following error: Error in lapply(c("layla", seq(length(x) - n)), `+`, seq(n)) %>% map_chr(~paste(x[.], : could not find function "%>%" – Melanie Aug 14 '18 at 14:17
@Melanie the `%>%` function (called the "pipe") is not part of R by default, but is loaded with the `tidyverse` library (among others). The error is telling you it can't find `%>%` because `tidyverse` isn't loaded. You can run `install.packages('tidyverse')`, then run `library(tidyverse)` to load `tidyverse` before running the code. Another option is to use my other method which gives the same result and doesn't require `tidyverse`. (in the code block where it says "Without Tidyverse") – IceCreamToucan Aug 14 '18 at 14:21

AntoniosK · Answer 2 · 2018-08-12T19:45:15.743

An alternative tidyverse approach that creates a big dataframe of results based on how many values you want your subsequences to have:

library(tidyverse)

# example vector
x <- c(1,6,1,2,3,4,5,8,5,9,10,1,2,3,10,7,5,9,4,1,2,3)

# function that gets as input number of consequtive elements in a subsequence
# and returns an ordered dataframe by counts of occurence
f = function(n) {

  data.frame(value = x) %>%               # get the vector x
    slice(1:(nrow(.)-n+1)) %>%            # remove values not needed from the end
    mutate(position = row_number()) %>%   # add position of each value
    rowwise() %>%                         # for each value/row
    mutate(vec = paste0(x[position:(position+n-1)], collapse = ",")) %>% # create subsequences as a string
    ungroup() %>%                         # forget the grouping
    count(vec, sort = T) }                # order by counts descending


2:5 %>%                    # specify how many values in your subsequences you want to investigate (let's say from 2 to 5)
  map_df(~ data.frame(NumElements = ., f(.))) %>%  # apply your function and keep the number values
  arrange(desc(n)) %>%     # order by counts descending
  tbl_df()                 # (only for visualisation purposes)


# # A tibble: 88 x 3
#   NumElements vec       n
#         <dbl> <chr> <int>
# 1           2 1,2       3
# 2           2 2,3       3
# 3           3 1,2,3     3
# 4           2 5,9       2
# 5           2 1,6       1
# 6           2 10,1      1
# 7           2 10,7      1
# 8           2 3,10      1
# 9           2 3,4       1
# 10          2 4,1       1
# # ... with 78 more rows

lebatsnok · Answer 3 · 2018-08-14T18:41:08.730

The approach below finds sequences of any length (k): the input vector is converted into a matrix with k rows; this is done k times with adding 0:(k-1) NA's to the beginning. Finally, all rows in these k matrices are counted (paste'ing the elements together):

frs <- function(x, k=2){
   padit <- function(.) c(.,rep(NA, k-length(.)%%k))
   xx <- lapply(1:k, function(iii) padit(c(rep(NA,iii-1), x)))
   xx <- do.call(rbind, lapply(xx, function(.) matrix(., ncol=k, byrow=TRUE)))
   xx <- sapply(split(xx, 1:NROW(xx)), paste, collapse=",")
   (function(x) x[x>1])(table(xx))

}

Output:

> frs(x,2)
xx
1,2 2,3 5,9 
  3   3   2 
> frs(x,3)
1,2,3 
    3 
> frs(x,4)
named integer(0)

How to identify repeated subsequences in a dataset

3 Answers3