1

Problem

I have two strings $base and $succ and want to check whether $succ can be created from $base by inserting arbitrary strings at arbitrary positions.

Some examples, written as isSucc($base, $succ):

  • isSucc("abc", "abcX") is true, since we can insert X at the end of abc.
  • isSucc("abc", "XabYcZ") is true, since we can insert XabYcZ.
  • isSucc("abc", "abX") is false, since c was deleted.
  • isSucc("abc", "cab") is false, since c (the one at the end of abc) was deleted.
  • isSucc("abc", "cabc") is true, since we can insert c at the beginning of abc.
  • isSucc("", "anything") is true, since we just have to insert "anything".

We can assume, that $base and $succ do not contain any (vertical) withespace. Usually we have length($base) < length($succ) < 1000. Good performance is not critical but would be nice to have.

I already know one way to implement isSucc*.

Question

  • Is there a perl-module that defines something similar to isSucc?
  • Does someone has an easier/faster/alternative implementation* of isSucc?

* My Approach

Compute the edit/Levenshtein distance with a custom cost model (cost(Insertion)=0, cost(Deletion)=cost(substitution)=1). Then check whether the edit distance is 0.

Comparing the Answers

I wanted to compare the three solutions loop, greedy matching, and non-greedy matching, but the matching methods often took more than 100 times as long as the loop solution so I aborted the tests. Nevertheless – or perhaps exactly for that reason – we have a clear winner: The loop solution.
Big thanks to Christoffer Hammarström.

Socowi
  • 25,550
  • 3
  • 32
  • 54

4 Answers4

3
sub is_subsequence {
    my ($needles, $haystack) = @_;
    my $found = 0;
    for my $needle (split '', $needles) {               # for each character $needle in $needles
        $found = 1 + index $haystack, $needle, $found;  # find it after the previous one in $haystack
        return 0 unless $found;                         # return false if we can't
    }
    return 1;                                           # return true if we found all $needles in $haystack
}

use Test::More tests => 6;              # 1..6
is 1, is_subsequence("abc", "abcX");    # ok 1
is 1, is_subsequence("abc", "XabYcZ");  # ok 2
is 0, is_subsequence("abc", "abX");     # ok 3
is 0, is_subsequence("abc", "cab");     # ok 4
is 1, is_subsequence("abc", "cabc");    # ok 5
is 1, is_subsequence("", "anything");   # ok 6
Christoffer Hammarström
  • 27,242
  • 4
  • 49
  • 58
  • 3
    I'm not sure I agree. I mean, yes, you *can* do this by finding the longest common subsequence and then comparing; but the longest common subsequence problem involves dynamic programming (hence extra storage and so on) and quadratic time, whereas the OP's problem can be solved with a greedy algorithm in linear time. – ruakh Nov 12 '16 at 15:26
  • @ruakh Oops, seems like I missed the obvious and fast solution. Many thanks to the two of you! – Socowi Nov 12 '16 at 15:44
  • @ChristofferHammarström such a simple solution, nice one. I found it a little hard to work out though as `is_subsequence` through me off since `abc` is not a subsequence of `XabYcZ` for example. Might I suggest renaming the sub to something like `is_expandable_into` ? – Peter R Mar 07 '17 at 16:30
  • 2
    @PeterR: `abc` is a subsequence of `XabYcZ`, but it is not a substring. See https://en.wikipedia.org/wiki/Subsequence which says "The subsequence should not be confused with substring" – Christoffer Hammarström Mar 07 '17 at 16:45
1
sub isSucc {
 my($base, $succ)=@_;
 $base=~s/./quotemeta($&).".*?"/ge;
 $succ =~ $base;
}

Create regular expression a.*?b.*?c.*? for string abc and test $succ.

Mike
  • 1,985
  • 1
  • 8
  • 14
-1
{
    my $last_base;
    my $last_re;

    sub is_succ {
        my ($base, $succ) = @_;
        my $re;

        if ($base eq $last_base) {
            $re = $last_re;
        }
        else {
            $last_base = $base;
            $last_re = $re = join(".*?", map { quotemeta($_) } split("", $base));
        }

        return $succ =~ /$re/;
    }
}
Denis Ibaev
  • 2,470
  • 23
  • 29
  • 2
    You should use a non-greedy quantifier `.*?` to avoid catastrophic backtracking. – nwellnhof Nov 12 '16 at 16:33
  • @nwellnhof: Changing to non-greedy quantifiers doesn't in general result in an improvement in backtracking. For some strings, greedy quantifiers may produce a faster result. – Borodin Nov 12 '16 at 16:59
  • @Borodin In this specific case, greedy quantifers are dangerous. Try for yourself: `perl -MBenchmark=:all -e"cmpthese(100000, { greedy => sub { 'aaaaaaaaaa' =~ /a.*a.*a.*a.*a.*a.*a.*a.*a.*a/ }, lazy => sub { 'aaaaaaaaaa' =~ /a.*?a.*?a.*?a.*?a.*?a.*?a.*?a.*?a.*?a/ } });"` – nwellnhof Nov 12 '16 at 17:28
  • @nwellnhof: That is an artificial test case designed to exacerbate the problem. Try it with the OP's own data. – Borodin Nov 12 '16 at 17:39
  • @nwellnhof: `perl -MBenchmark=:all -e"cmpthese(1E7, { greedy => sub { 'axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaa' =~ /a.*a.*a.*a.*a.*a.*a.*a.*a.*a/ }, lazy => sub { 'axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaa' =~ /a.*?a.*?a.*?a.*?a.*?a/ } });"` – Borodin Nov 12 '16 at 18:58
  • @Borodin I don't see what you're getting at. Your example is basically a tie between lazy and greedy quantifiers on my machine. For the type of the regex in question, lazy quantifiers should be at least as fast as greedy ones, but they avoid pathological cases. Even with real world data like `banana`, they can be considerably faster. – nwellnhof Nov 12 '16 at 19:15
-1

Power of regex can make this task quite simple

use strict;
use warnings;
use feature 'say';

my $needle = 'abc';

while(<DATA>) {
    chomp;
    say "'$needle' in '$_'" if search_needle($needle,$_);
}

say "'' in 'Anything'" if search_needle('','Anything');

sub search_needle {
    my $needle = shift;
    my $haystack = shift;

    my $re = join('.*?', split('',$needle));

    return $haystack =~ /$re/;

}

__DATA__
abcX
XabYcZ
abX
cab
cabc

Output

'abc' in 'abcX'
'abc' in 'XabYcZ'
'abc' in 'cabc'
'' in 'Anything'
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • Note that you can't blindly add characters from `needle` to a regex. You need to escape the meta characters first, as was already done over three years ago by another answer here. – brian d foy Jun 12 '20 at 12:18