-5

Imagine that we have a list like [255,7,0,0,255,7,0,0,255,7,0,0,255,7,0,0] We want to find the shortest common subsequence (NOT SUBSTRING) that contains all the items in the subsequence which will be 255,7,0,0 in this case but we don't know the length of the pattern.

The program should work even if there are some gibberish in the middle like this sequence. 255,7,0,0,4,3,255,5,6,7,0,0,255,7,0,0,255,7,0,1,2,0, it should return the repeating subsequence which will be 255,7,0,0.

I tried longest common subsequence but since the algorithm is greedy, it does not work for this case, since it will return all the matches not the shortest one. Your help is highly appreciated.

import numpy as np
cimport numpy as np
from libc.stdlib cimport *
from clcs cimport *
np.import_array()
def lcs_std(x, y):

"""Standard Longest Common Subsequence (LCS)
algorithm as described in [Cormen01]_.Davide Albanese
The elements of sequences must be coded as integers.

:Parameters:
   x : 1d integer array_like object (N)
      first sequence
   y : 1d integer array_like object (M)
      second sequence
:Returns:
   length : integer
      length of the LCS of x and y
   path : tuple of two 1d numpy array (path_x, path_y)
      path of the LCS
"""

cdef np.ndarray[np.int_t, ndim=1] x_arr
cdef np.ndarray[np.int_t, ndim=1] y_arr
cdef np.ndarray[np.int_t, ndim=1] px_arr
cdef np.ndarray[np.int_t, ndim=1] py_arr
cdef char **b
cdef int i
cdef Path p
cdef int length

x_arr = np.ascontiguousarray(x, dtype=np.int)
y_arr = np.ascontiguousarray(y, dtype=np.int)

b = <char **> malloc ((x_arr.shape[0]+1) * sizeof(char *))
for i in range(x_arr.shape[0]+1):
    b[i] = <char *> malloc ((y_arr.shape[0]+1) * sizeof(char))    

length = std(<long *> x_arr.data, <long *> y_arr.data, b,
              <int> x_arr.shape[0], <int> y_arr.shape[0])

trace(b, <int> x_arr.shape[0], <int> y_arr.shape[0], &p)

for i in range(x_arr.shape[0]+1):
    free (b[i])
free(b)

px_arr = np.empty(p.k, dtype=np.int)
py_arr = np.empty(p.k, dtype=np.int)

for i in range(p.k):
     px_arr[i] = p.px[i]
     py_arr[i] = p.py[i]

free (p.px)
free (p.py)

return length, (px_arr, py_arr)
  • Edit your question to include the code you tried please. – pzp May 07 '15 at 21:31
  • The example in your second paragraph with "gibberish" will not identify 255,7,0,0 because this sequence will not contain all items of the sequence (a requirement from the first paragraph). – styts May 07 '15 at 21:32
  • It should still contain all items of repeating sequence – user2628665 May 07 '15 at 21:41
  • When you say "shortest common subsequence (NOT SUBSTRING)", what is the distinction you're trying to make? And the next phrase, "contains all the items in the subsequence", makes it sound like you're talking about a subset (ignoring order), not a subsequence. So… can you define exactly what you're looking for? – abarnert May 07 '15 at 21:46
  • 1
    Also, if `[255, 7, 0, 0]` is in a "subsequence" in your sense of `[255, 5, 6, 7, 0]`, why isn't `[255]` also a "subsequence", and an obviously shorter one? – abarnert May 07 '15 at 21:48
  • Finally, "I tried longest common subsequence"… Why did you try that? Getting the longest common subsequence obviously isn't going to do the right thing if you want the shortest common subsequence. If you can explain the thought process that made you think LCS would be helpful, it might help us understand the problem. (For comparison, imagine you wanted to find `min`, and you knew how to find `max`. You might try to negate all the numbers, find the `max`, then negate it, but get stuck somewhere on the way. Did you think you were on the track for a similar solution to get SCS from LCS?) – abarnert May 07 '15 at 21:50
  • @abarnert yeah, in every one of OP's examples, what OP is calling the shortest common subsequence is actually the longest one. – abcd May 07 '15 at 21:51
  • 1
    @dbliss Isn't the longest one the entire sequence? Or at least half the sequence, if with "common" he means "repeated"? – Stefan Pochmann May 07 '15 at 22:03
  • @StefanPochmann ah, good call, the longest would be half the sequence (for the first example only), assuming "common" means "repeated." so the OP is picking out neither the longest nor the shortest. (for the gibberish example he's picking out the longest one.) – abcd May 07 '15 at 22:05
  • 1
    @dbliss No, the gibberish doesn't change it, still entire/half sequence. [Substring](http://en.wikipedia.org/wiki/Substring) means consecutive elements, [subsequence](http://en.wikipedia.org/wiki/Subsequence) just means elements in order, you can skip. And he specifically said subsequence, not substring. – Stefan Pochmann May 07 '15 at 22:09
  • @StefanPochmann ok, that actually clarifies things a lot. we're not working with the `python` definitions for `string` and sequence, but some other definitions. – abcd May 07 '15 at 22:14
  • @dbliss Yeah, at least that's what I assume. Btw, do you see his desired output for the gibberish example? I just see the sentence end abruptly, right before I'd expect the output... – Stefan Pochmann May 07 '15 at 22:14
  • @StefanPochmann naa, it isn't there. i think he means `255, 7, 0, 0`, though, given the first paragraph. – abcd May 07 '15 at 22:16
  • I am not sure whether it should be called short or long subsequence but the goal is to find the repeating subsequence that contains all the items that are being repeated. Not the whole sequence, Obviously, the LCS will return the whole sequence in the first choice and the repeating sequence minus the gibberish in the second choice but it is still greedy. Would someone also explain why I got 6 minus points on my question? It has not been asked before and it is a genuine question. Thanks for your help. – user2628665 May 07 '15 at 22:25

1 Answers1

1

Have a look at sequential pattern mining

You seem to have reinvented frequent itemsets in sequences, but I think there are a dozen algorithms for that.

Han, J.; Cheng, H.; Xin, D.; Yan, X. (2007). "Frequent pattern mining: current status and future directions". Data Mining and Knowledge Discovery 15 (1): 55–86.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194