Why does this Rascal pattern matching code use so much memory and time?

Question

I'm trying to write what I would think of as an extremely simple piece of code in Rascal: Testing if list A contains list B.

Starting out with some very basic code to create a list of strings

public list[str] makeStringList(int Start, int End)
{
    return [ "some string with number <i>" | i <- [Start..End]];
}

public list[str] toTest = makeStringList(0, 200000);

My first try was 'inspired' by the sorting example in the tutor:

public void findClone(list[str] In,  str S1, str S2, str S3, str S4, str S5, str S6)
{
    switch(In)
    {
        case [*str head, str i1, str i2, str i3, str i4, str i5, str i6, *str tail]:   
        {
            if(S1 == i1 && S2 == i2 && S3 == i3 && S4 == i4 && S5 == i5 && S6 == i6)
            {
                println("found duplicate\n\t<i1>\n\t<i2>\n\t<i3>\n\t<i4>\n\t<i5>\n\t<i6>");
            }
            fail;
         }   
         default:
            return;
    }
}

Not very pretty, but I expected it to work. Unfortunately, the code runs for about 30 seconds before crashing with an "out of memory" error.

I then tried a better looking alternative:

public void findClone2(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *str mid, *str end] := In)
    if (mid == whatWeSearchFor)
        println("gotcha");
}

with approximately the same result (seems to run a little longer before running out of memory)

Finally, I tried a 'good old' C-style approach with a for-loop

public void findClone3(list[str] In, list[str] whatWeSearchFor)
{
    cloneLength = size(whatWeSearchFor);
    inputLength = size(In);

    if(inputLength < cloneLength) return [];

    loopLength = inputLength - cloneLength + 1;

    for(int i <- [0..loopLength])
    {
        isAClone = true;
        for(int j <- [0..cloneLength])
        {
            if(In[i+j] != whatWeSearchFor[j])
                isAClone = false;
        }

        if(isAClone) println("Found clone <whatWeSearchFor> on lines <i> through <i+cloneLength-1>");   
    }
}

To my surprise, this one works like a charm. No out of memory, and results in seconds.

I get that my first two attempts probably create a lot of temporary string objects that all have to be garbage collected, but I can't believe that the only solution that worked really is the best solution.

Any pointers would be greatly appreciated.

My relevant eclipse.ini settings are

-XX:MaxPermSize=512m
-Xms512m
-Xss64m
-Xmx1G

score 1 · Answer 1 · answered Oct 30 '15 at 23:59

We'll need to look to see why this is happening. Note that, if you want to use pattern matching, this is maybe a better way to write it:

public void findClone(list[str] In,  str S1, str S2, str S3, str S4, str S5, str S6) {
    switch(In) {
        case [*str head, S1, S2, S3, S4, S5, S6, *str tail]: {
            println("found duplicate\n\t<S1>\n\t<S2>\n\t<S3>\n\t<S4>\n\t<S5>\n\t<S6>"); 
        } 
        default: 
            return; 
    } 
}

If you do this, you are taking advantage of Rascal's matcher to actually find the matching strings directly, versus your first example in which any string would match but then you needed to use a number of separate comparisons to see if the match represented the combination you were looking for. If I run this on 110145 through 110150 it takes a while but works and it doesn't seem to grow beyond the heap space you allocated to it.

Also, is there a reason you are using fail? Is this to continue searching?

Thank you very much for your answer, Mark. Yes, the purpose of `fail` is to continue searching as I want to find all occurrences of the 'sublist' in the list. Your solution does seem to work as it does not throw an 'out of memory' error (even when I added the `fail` to force it to search through the whole list), but it is still a lot slower than the loop based one. This surprised me as I usually am told not to write in that style with modern program languages (I am a C++ programmer by profession). — Bouke, Oct 31 '15 at 10:22

score 0 · Answer 2 · answered Oct 31 '15 at 12:41

It's an algorithmic issue like Mark Hills said. In Rascal some short code can still entail a lot of nested loops, almost implicitly. Basically every * splice operator on a fresh variable that you use on the pattern side in a list generates one level of loop nesting, except for the last one which is just the rest of the list.

In your code of findClone2 you are first generating all combinations of sublists and then filtering them using the if construct. So that's a correct algorithm, but probably slow. This is your code:

void findClone2(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *str mid, *str end] := In)
    if (mid == whatWeSearchFor)
        println("gotcha");
}

You see how it has a nested loop over In, because it has two effective * operators in the pattern. The code runs therefore in O(n^2), where n is the length of In. I.e. it has quadratic runtime behaviour for the size of the In list. In is a big list so this matters.

In the following new code, we filter first while generating answers, using fewer lines of code:

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*str head, *whatWeSearchFor, *str end] := In)
        println("gotcha");
}

The second * operator does not generate a new loop because it is not fresh. It just "pastes" the given list values into the pattern. So now there is actually only one effective * which generates a loop which is the first on head. This one makes the algorithm loop over the list. The second * tests if the elements of whatWeSearchFor are all right there in the list after head (this is linear in the size of whatWeSearchFor and then the last *_ just completes the list allowing for more stuff to follow.

It's also nice to know where the clone is sometimes:

public void findCloneLinear(list[str] In, list[str] whatWeSearchFor)
{
    for ([*head, *whatWeSearchFor, *_] := In)
        println("gotcha at <size(head)>");
}

Rascal does not have an optimising compiler (yet) which might possibly internally transform your algorithms to equivalent optimised ones. So as a Rascal programmer you are still asked to know the effect of loops on your algorithms complexity and know that * is a very short notation for a loop.

Wow, thanks jurgen. Your solution seems to be just as fast as the one with explicit loops. Out of curiosity, why is your solution faster than Mark's with a `fail` added after the println? Should that not also create only a loop on the fist `*` operator? — Bouke, Nov 01 '15 at 10:51
Although the clone location detection does not seem to work that well. It seems like `head` only contains the first string in the list and thus `size(head)` will return the length of that string. — Bouke, Nov 01 '15 at 11:17
I played a little more with your solution and found that I could get `head` to be treated as a list of strings was by adding `str` between `*` and `head`. Now I do get the correct outcome, but it has gotten a lot slower (slower than my 'explicit loops' implementation: `map[str, num]: ("explicit loops":1875,"implicit loops":121533)`). The thing is that your original implementation did give the correct number of Gotcha's, so it did seem to detect them properly. Did I somehow trigger the creation of intermediate string objects by adding the `str` keyword, thus creating a lot of GC overhead? — Bouke, Nov 01 '15 at 11:30
I don't know without having a look at your code. Let's do that Tuesday "live". `*` is indeed the notation to match a list. My solution is faster than Mark's one because of arbitrary reasons, I don't think that difference will survive an arbitrary new version of Rascal. The complexity of the algorithm is the same. — Jurgen Vinju, Nov 02 '15 at 08:53

Why does this Rascal pattern matching code use so much memory and time?

2 Answers2