Approximate pattern matching in a sequence of integer data and extraction using R

Question

I have a pattern of integers as c(1,2,3,4,5) that needs to be approximately matched in a data as c(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

I have tried:

pmatch()
all.equal()
grepl()

but they don't seem to support this scenario.

pattern <- c(1,2,3,4,5)

data <- c(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

For above example I need to produce following output:

1,6,3,4,5

1,2,3,4,5

1,2,3,4,6

Appreciate any thoughts on this.

Thanks

How you are getting these outputs is unclear. Please explain what you are doing to go from the inputs to the outputs. — Rich Scriven, Dec 07 '15 at 04:28
@RichardScriven - it's terribly unclear, but it seems to be matching in sets, i.e. - remove the first batch of closest matches, then start again. `1:5` matches `1,6,3,4,5` pretty closely, then `1,2,3,4,5`, then `1,2,3,4,6` — thelatemail, Dec 07 '15 at 04:33
Like an approximate version of this: http://stackoverflow.com/questions/33027611/how-to-index-a-vector-sequence-within-a-vector-sequence/33028695 — thelatemail, Dec 07 '15 at 04:40
How do you want to handle overlapping sequences, for example: `c(1,2,3,4,1,2,3,4,5)` — Gary Weissman, Dec 07 '15 at 14:02

score 2 · Answer 1 · answered Dec 07 '15 at 14:01

I think you are saying "match a sequence of integers in another sequence of integers where at least N-1 of the integers match". It's unclear what the behavior should be in the case of overlapping matches, so the following will pick up sequences that do overlap.

# helper function to test "match" at a threshold of 4 matches
is_almost <- function(s1, s2, thresh = 4) {
   sum(s1 == s2) >= thresh }

# function to lookup and return sequences
extract_seq <- function(pattern, data) {
   res <- lapply(1:(length(data) - length(pattern) + 1), function(s) {
   subseq <- data[s:(s+length(pattern)-1)]
   if (is_almost(pattern, subseq)) { 
      subseq}
   })
   Filter(Negate(is.null),res)
}

# let's test it out
pattern <- c(1,2,3,4,5)
data <- c(1,10,1,6,3,4,5,1,2,3,4,5,9,10,1,2,3,4,6)

extract_seq(pattern,data)

[[1]]
[1] 1 6 3 4 5

[[2]]
[1] 1 2 3 4 5

[[3]]
[1] 1 2 3 4 6

I tried Gary's solution with a numeric vector (data) of size 500 thousand. It took just 6 seconds to produce the results. — Nasir, Dec 18 '15 at 05:40

score 0 · Answer 2 · answered Dec 10 '15 at 21:56

If you want to find the unique elements in a vector that match a given vector you can use %Iin% to test for the presence of your 'pattern' within the larger vector. The operator, %in%, returns a logical vector. Passing that output to which() returns the index of each TRUE value which can be used to subset the larger vector to return all of the elements that match the 'pattern', regardless of order. Passing the subset vector to unique() eliminates duplicates so that there is only one occurence of each element from the larger vector that matches the elements and length of the 'pattern' vector.

For example:

> num.data <- c(1, 10, 1, 6, 3, 4, 5, 1, 2, 3, 4, 5, 9, 10, 1, 2, 3, 4, 5, 6)
> num.pattern.1 <- c(1,6,3,4,5)
> num.pattern.2 <- c(1,2,3,4,5)
> num.pattern.3 <- c(1,2,3,4,6)
> unique(num.data[which(num.data %in% num.pattern.1)])
[1] 1 6 3 4 5
> unique(num.data[which(num.data %in% num.pattern.2)])
[1] 1 3 4 5 2
> unique(num.data[which(num.data %in% num.pattern.3)])
[1] 1 6 3 4 2

Notice that the first result matches the order of num.pattern.1 by coincidence. The other two vectors do not match the order of the pattern vectors.

To find the exact sequence within num.data that matches the patterns you can use something similar to the following function:

set.seed(12102015)
test.data <- sample(c(1:99), size = 500, replace = TRUE)
test.pattern.1 <- test.data[90:94]

find_vector <- function(test.data, test.pattern.1) {
   # List of all the vectors from test.data with length = length(test.pattern.1), currently empty
   lst <- vector(mode = "list")
   # List of vectors that meet condition 1, currently empty
   lst2 <- vector(mode = "list")
   # List of vectors that meet condition 2, currently empty
   lst3 <- vector(mode = "list")
   # A modifier to the iteration variable used to build 'lst'
   a <- length(test.pattern.1) - 1
   # The loop to iterate through 'test.data' testing for conditions and building lists to return a match
   for(i in 1:length(test.data)) {
     # The list is build incrementally as 'i' increases
     lst[[i]] <- test.data[c(i:(i+a))]
     # Conditon 1
     if(sum(lst[[i]] %in% test.pattern.1) == length(test.pattern.1)) {lst2[[i]] <- lst[[i]]}
     # Condition 2
     if(identical(lst[[i]], test.pattern.1)) {lst3[[i]] <- lst[[i]]}
   }
   # Remove nulls from 'lst2' and 'lst3'
   lst2 <- lst2[!sapply(lst2, is.null)]
   lst3 <- lst3[!sapply(lst3, is.null)]
# Return the intersection of 'lst2' and 'lst3' which should be a match to the pattern vector.
return(intersect(lst2, lst3))
}

For reproducibility I used set.seed() and then created a test data set and pattern. The function find_vector() takes two arguments: first, test.data that is the larger numerical vector you wish to check for pattern vectors and second, test.pattern.1 that is the shorter numerical vector you wish to find in test.data. First, three lists are created: lst to hold test.data divided into smaller vectors of length equal to the length of the pattern vector, lst2 to hold the pattern vectors from lst that satisfy the first condition, and lst3 to hold from lst the vectors that satisfy the second condition. The first condition tests that the elements of the vectors in lst are in the pattern vector. The second condition tests that the vector from lst matches the pattern vector by order and by element.

One problem with this approach is that NULL values are introduced into each list when the conditions are not satisfied, but the process stops when the conditions are satisfied. For reference you may print the lists to see all the vectors tested, the vectors that meet the first condition, and the vectors that meet the second condition. The nulls can be removed. With the nulls removed, finding the intersection of lst2 and lst3 will reveal the pattern matched identically in test.data.

To use the function make sure to explicitly define test.data <- 'a numeric vector' and test.pattern.1 <- 'a numeric vector'. No special packages are needed. I didn't do any benchmarking, but the function appears to work fast. I also did not look for scenarios where the function would fail.

Approximate pattern matching in a sequence of integer data and extraction using R

2 Answers2