12

Is there a function that searches a sequence of elements for a subsequence? I am looking for an analogue of StringPosition for Lists. In my current application I am working with integer lists, but I'd be interested in a general FindSequence[list, pattern, n] function which will find the first n occurrences of pattern in list.


Here's a toy example:

Generate some data:

In[1]:= $HistoryLength = 0    
Out[1]= 0

In[2]:= Timing[digits = First[RealDigits[\[Pi], 2, 10000000]];]    
Out[2]= {26.5, Null}

Let's convert it to a string so we can compare to StringPosition. This is very slow an memory hungry. (The memory is freed when the evaluation finishes.)

In[3]:= Timing[str = StringJoin[ToString /@ digits];]    
Out[3]= {43.813, Null}

I am looking for this subsequence:

In[4]:= patt = {1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 
   1, 0, 1, 1};

In[5]:= strpatt = StringJoin[ToString /@ patt];

Searching the string is very fast:

In[6]:= StringPosition[str, strpatt] // Timing    
Out[6]= {1.047, {{5737922, 5737943}}}

This is a simple implementation of searching for numerical arrays. It's slower than StringPosition:

In[7]:= Timing[
           corr = ListCorrelate[patt, digits];
           Select[Flatten@Position[corr, patt.patt], 
             digits[[# ;; # + Length[patt] - 1]] === patt &]
        ]

Out[7]= {2.234, {5737922}}

Summary:

  1. Is there a builtin that searches lists for subsequences?
  2. If there isn't, what is a fast and elegant implementation for numeric lists (my practical problem)?
  3. What about generic lists that can contain anything? (There are two possibilities here: "static" patterns only such as {1,0,1}, or general ones like {1,_,1}, though these latter ones may introduce complications.)

I expect this will have many solutions, some fast, some more elegant, some more general :-)


Related questions:

Interesting reading:


EDIT:

I just found the undocumented LongestCommonSubsequencePositions. LongestCommonSubsequencePositions[a, b] will find the longest common subsequence of the lists a and b, and return position of its first occurrence only in both a and b. (The documented LongestCommonSubsequence, which I was not aware of, will only return the subsequence itself, not its position.)

It is slower than the alternatives above, but it works on general lists that can contain any expression.

In[57]:= LongestCommonSubsequencePositions[digits, patt] // Timing
Out[57]= {5.25, {{5737922, 5737943}, {1, 22}}}
Community
  • 1
  • 1
Szabolcs
  • 24,728
  • 9
  • 85
  • 174

3 Answers3

16

You can use ReplaceList with a "prefix" and "suffix" of ___ and match the whole list. This gives you all the replacements that can be made (as opposed to Replace). The position of your pattern is then simply the length of the prefix + 1. It's pretty fast as well:

In[40]:= Timing[ReplaceList[digits, Join[{pre___}, patt, {___}] :> Length[{pre}]
   + 1]]

Out[40]= {1.3059, {5737922}}

Edit: figured it's slightly more elegant to use a delayed rule than to map Length afterwards.

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
Jan Pöschko
  • 5,412
  • 1
  • 28
  • 28
  • 1
    This is a very good idea! But we don't need `ReplaceList`. A faster version is `Replace[digits, {Join[{pre___}, patt, {___}] :> 1 + Length[{pre}], _ -> "not found"}]`. It runs in ~1.65 seconds on my machine, which means that it's faster than my `ListCorrelate` solution *and* it works with list of general expressions. Big **+1**! – Szabolcs Jan 05 '12 at 10:07
  • I am really surprised by the speed of this considering that it seems to unpack the array `digits`. It should *theoretically* also result in a memory spike due to the unpacking but it's too fast to observe on my process monitor, so I don't know if it actually happens. I haven't even considered things like `Replace` for the integer-only problem because I was almost sure that anything pattern based would be quite slow. – Szabolcs Jan 05 '12 at 10:13
  • 1
    AFAIK `Replace` will only give you one matching replacement - if you don't care about more positions, it's fine of course. And by the way, I forgot to add that you can add another parameter to limit the number of matches. – Jan Pöschko Jan 05 '12 at 10:15
  • Yes, you are right! I was focusing too much on getting the first match first :-) I had some doubt about asking this question considering that I already has a working solution for the integer-only problem that I was facing, and that the 2D case was asked before here. But your answer shows that it was definitely worth asking. – Szabolcs Jan 05 '12 at 10:24
  • @JanPöschko We have a proposal for a [mathematica specific site](http://area51.stackexchange.com/proposals/37304/mathematica?referrer=hWeRiD9Qz0oIDM_9uBjtlA2) for anything related to mathematica. It would be nice if you could commit (involves creating an Area51 account) to the proposal. We're pretty close to launching (only need about 50 more users) – abcd Jan 05 '12 at 18:52
  • @JanPöschko The [Mathematica site](http://mathematica.stackexchange.com/) that yoda mentioned has already graduated some times, and all serious Mathematica traffic has moved there from StackOverflow. I thought you might be interested. Sometimes people ask about your program Mathics in the chatroom – Szabolcs Nov 26 '12 at 22:58
  • For a slight speed improvement (~5% in my very limited test) and equal elegance (IMHO), try `Timing[ReplaceList[ digits, {pre___, Sequence @@ patt, ___} :> Length[{pre}] + 1]]` – Shwouchk Apr 16 '16 at 14:50
4

Please have a look at functions seqPos (general lists) and seqposC (integer lists, compiled), which do exactly what you ask for, and are fast. I used them in this answer (for the question you actually linked to).

Here are the timing results for various solutions:

In[15]:= seqPos[digits, patt] // Timing
Out[15]= {1.297, {5737922}}

In[16]:= seqposC[digits, patt] // Timing
Out[16]= {0.125, {5737922}}

In[17]:= 
Timing[corr = ListCorrelate[patt, digits];
      Select[Flatten@Position[corr, patt.patt], 
         digits[[# ;; # + Length[patt] - 1]] === patt &]]

Out[17]= {0.844, {5737922}}

In[18]:= Timing[
    ReplaceList[digits, Join[{pre__}, patt, {___}] :> Length[{pre}] + 1]]

Out[18]= {0.953, {5737922}}

In[19]:= AbsoluteTiming[cf[digits, patt]]
Out[19]= {3.1914063, 5737922}

These indicate that your approach with ListCorrelate is not bad at all. My first function seqPos (it is actually due to Norbert Pozar) is a bit slower but then it is completely general, while seqposC is much faster.

Community
  • 1
  • 1
Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • You caught me ... I didn't pay attention. Do you also find it surprising that the `ReplaceList` approach does so well? Mathematica performance is so unpredictable ... – Szabolcs Jan 05 '12 at 11:02
  • 1
    @Szabolcs It's understandable. You gave a pretty fast and elegant solution to that question, then I came with an ugly but yet faster one... Yours is still the most elegant for that problem IMO. As for this one, I knew about this solution (with `ReplaceList`) before. So, I had my good deal of surprise with it in the past, when I first encountered `ReplaceList`. It is fast because the pattern is purely syntactic and very simple. It is still surprising that it is *so* fast, but I know of other instances as well where `ReplaceList` is very competitive. A very underrated builtin function, IMO. – Leonid Shifrin Jan 05 '12 at 11:10
2

Here is a compiled version, that avoids the String conversion but is not faster.

cf = Compile[{{in, _Integer, 1}, {patt, _Integer, 1}},
  Block[{lp, res},
   lp = Length[patt];
   res = 0;
   Do[
    If[Total[Abs[in[[i ;; i + lp - 1]] - patt]] == 0,
      res = i; Break[]];
    , {i, 1, Length[in] - lp}];
   res
   ]
  , CompilationTarget -> "C", RuntimeOptions -> "Speed"]


AbsoluteTiming[cf[digits, patt]]