Imagine that we have a list like
[255,7,0,0,255,7,0,0,255,7,0,0,255,7,0,0]
We want to find the shortest common subsequence (NOT SUBSTRING) that contains all the items in the subsequence which will be 255,7,0,0
in this case but we don't know the length of the pattern.
The program should work even if there are some gibberish in the middle like this sequence. 255,7,0,0,4,3,255,5,6,7,0,0,255,7,0,0,255,7,0,1,2,0
, it should return the repeating subsequence which will be 255,7,0,0
.
I tried longest common subsequence but since the algorithm is greedy, it does not work for this case, since it will return all the matches not the shortest one. Your help is highly appreciated.
import numpy as np
cimport numpy as np
from libc.stdlib cimport *
from clcs cimport *
np.import_array()
def lcs_std(x, y):
"""Standard Longest Common Subsequence (LCS)
algorithm as described in [Cormen01]_.Davide Albanese
The elements of sequences must be coded as integers.
:Parameters:
x : 1d integer array_like object (N)
first sequence
y : 1d integer array_like object (M)
second sequence
:Returns:
length : integer
length of the LCS of x and y
path : tuple of two 1d numpy array (path_x, path_y)
path of the LCS
"""
cdef np.ndarray[np.int_t, ndim=1] x_arr
cdef np.ndarray[np.int_t, ndim=1] y_arr
cdef np.ndarray[np.int_t, ndim=1] px_arr
cdef np.ndarray[np.int_t, ndim=1] py_arr
cdef char **b
cdef int i
cdef Path p
cdef int length
x_arr = np.ascontiguousarray(x, dtype=np.int)
y_arr = np.ascontiguousarray(y, dtype=np.int)
b = <char **> malloc ((x_arr.shape[0]+1) * sizeof(char *))
for i in range(x_arr.shape[0]+1):
b[i] = <char *> malloc ((y_arr.shape[0]+1) * sizeof(char))
length = std(<long *> x_arr.data, <long *> y_arr.data, b,
<int> x_arr.shape[0], <int> y_arr.shape[0])
trace(b, <int> x_arr.shape[0], <int> y_arr.shape[0], &p)
for i in range(x_arr.shape[0]+1):
free (b[i])
free(b)
px_arr = np.empty(p.k, dtype=np.int)
py_arr = np.empty(p.k, dtype=np.int)
for i in range(p.k):
px_arr[i] = p.px[i]
py_arr[i] = p.py[i]
free (p.px)
free (p.py)
return length, (px_arr, py_arr)