bash subsequence matching speed-up

Question

I am wondering if there is an easy way to check if a string is a subsequence of another string in bash, actually a subsequence with an extra rule. I will explain.

Some subsequences of "apple" are "aple", "al", "pp" and "ale". The subsequences with an extra rule, I want to get are those that start and end with the same letter as the string so only "aple" and "ale" fit my desire.

I have made the following program:

#!/bin/bash
while read line
do
    search=$(echo "$line" | tr -s 'A-Za-z' | sed 's/./\.\*&/g;s/^\.\*//' )
    expr match "$1" "$search" >/dev/null && echo "$line"
done

It is executed as followed:

./program.sh greogdgedlqfe < words.txt

This program works, but is very slow.

It takes every line of the file, modify it to regex expression and then check if they match and then print the original line. So example:

one of the lines has the word google

$search becomes g.*o.*g.*l.*e (repeated letters become squeezed, extra rule )

then we check that expression with the given parameter and if it matches, we print the line: google

This works fine, however when the file words.txt gets too big, this program becomes too slow. How can I speed up my program, possibly by faster matching subsequences.

Edit after possible solution of Kamilcuk

That solution returns quick,quiff,quin,qwerty for the string "qwertyuihgfcvbnhjk" and only quick should be returned, so it is almost correct, but not quite yet.

Can you post some excerpt from words.txt and example outputs. I cannot test your script, some words that match and don't match for certain inputs would be helpful. Is `apppppppple` a subsequence of `apple`? Because your script will match it for `apple`. — KamilCuk, Nov 21 '19 at 16:11
If I understand correctly, there are only 4 valid subsequents of `apple`: `ae` `ale` `ape` `aple`. Right? — KamilCuk, Nov 21 '19 at 16:25
yes but apppppppppppe would also match in my program, that is intended. — fangio, Nov 21 '19 at 16:27
Och? So is `apple` also a subsequent of apple? The "subsequent" doesn't look like "sub-sequence" then, rather like an expansion. So a subsequence is just anything that matches the regex consisted of letters of a word with `.*` in between letters, end of story? — KamilCuk, Nov 21 '19 at 16:28
No, appppe is a subsequent of apple because my programm squeezes repeated characters first. so appppe becomes ape and that is subsequent of apple. — fangio, Nov 21 '19 at 16:30
Are you still looking for solution, do any of them work for you ? — dash-o, Nov 26 '19 at 17:03

KamilCuk · Accepted Answer · 2019-11-21T23:55:06.083

2

Try it like so:

grep -x "$(<<<"$1" tr -s 'A-Za-z' | sed 's/./&*/g;s/\*$//;s/\*//1')" words.txt

Tested against:

set -- apple  
cat >words.txt <<EOF
aple
al
pp
ale
fdafda
apppppppple
apple
google
EOF

outputs:

aple
ale
apppppppple
apple

And for set -- greogdgedlqfe it outputs just google.

If I understand you correctly, a "subsequent" of apple is everything that mathes ap*l*e.

Tested on repl

edited Nov 21 '19 at 23:55

answered Nov 21 '19 at 16:36

KamilCuk

120,984
8
59
111

very nice exactly what I want to implement Swype or SwiftKey – fangio Nov 21 '19 at 16:45
actually not, it should match a * p * l * e, no spaces however but needed for output – fangio Nov 21 '19 at 16:54
Small note: the grep pattern for 'greogdgedlqfe' is 'gr*e*o*g*d*g*e*d*l*q*f*e*'. I believe the 'extra' rule requested is for the pattern to be 'gr*e*o*g*d*g*e*d*l*q*f*e' (start with g, end with e). – dash-o Nov 21 '19 at 17:27
@KamilCuk almost correct, I edited my question with more examples and answer why yours is wrong – fangio Nov 21 '19 at 17:29
1

I think a minor change: 'grep -x "$(<<<"$1" tr -s 'A-Za-z' | sed 's/./&*/g;s/\*&//;s/\*//1;s/\*$//')" words.txt ' will address the `qwertyuihgfcvbnhjk` bug. – dash-o Nov 21 '19 at 18:42
There was a typo. The `&` was meant to be `$`, to remove the last. Dunno why didn't I see it. Now it returns quick for those 4 words in your edit.... – KamilCuk Nov 21 '19 at 23:55

chepner · Answer 2 · 2019-11-21T16:20:09.533

0

bash does not need to use expr (an external program) for regular-expression matching; it provides built-in access to your system's library.

#!/bin/bash
while read line
do
    search=$(echo "$line" | tr -s 'A-Za-z' | sed 's/./\.\*&/g;s/^\.\*//' )
    [[ $1 =~ $search ]] && echo "$line"
done

edited Nov 21 '19 at 16:20

answered Nov 21 '19 at 15:56

chepner

497,756
71
530
681

how to use grep in my program? – fangio Nov 21 '19 at 16:11
Sorry, forgot to feed the argument to `grep`. – chepner Nov 21 '19 at 16:15
The problem with `grep` it will just output the content of `$1`. We want to output the lines from `words.txt`, right? We would need to make grep output the regex it matched against $1 – KamilCuk Nov 21 '19 at 16:17
Good point. An even faster solution, though, would be to do all this in a single `awk` process. I'll see if I can get that right after several failed attempts. – chepner Nov 21 '19 at 16:24

choroba · Answer 3 · 2019-11-21T16:11:14.587

0

You can use a pattern instead of a regex. Just insert asterisk after each letter of each word (excpet the last letter) and use a normal pattern match.

#!/bin/bash
while read line
do
    pattern=""
    for ((i=${#line}-1 ; i>=0 ; --i)) ; do
        pattern="${line:i:1}*"$pattern
    done
    pattern=${pattern%'*'}

    if [[ "$1" == $pattern ]] ; then
        echo "$line"
    fi
done

edited Nov 21 '19 at 16:11

answered Nov 21 '19 at 15:59

choroba

231,213
25
204
289

dash-o · Answer 4 · 2019-11-22T09:07:36.997

Hard to beat perl with regexp.

Performance

The key to performance is to avoid forking extra processes. Most bash solutions presented here (with the exception of the KamilCuk grep based solution, which is not always correct) will require multiple calls to sed, tr, etc. Perl will outperform those solution. Even if a pure bash solution can be implemented (using bash RE, patterns), Perl is likely to outperform it, when the size of the word list is large.

Consider program.pl appl < words.txt

#! /usr/bin/perl
use strict ;

my $word = shift @ARGV ;

while ( <> ) {
    chomp ;
    my $p = $_ ;
    tr/A-Za-z//s ;
    s/(.)/.*$1/g ;
    s/^\.\*// ;
    print $p, "\n" if $word =~ "^$_\$" ;
} ;

Update 1: Perl implementation of KamilCuk solution + fix.

After minor fix, I believe possible to use the idea in the grep-based solution to create a Perl program that will be even faster. It create a single REGEXP, and test each word in the word list file. I think this is as optimal as possible with Perl.

#! /usr/bin/perl
use strict ;

$_ = shift @ARGV ;
tr/A-Za-z//s ;
s/(.)/$1*/g ;
s/\*// ;
s/\*$// ;
my $re = "^$_\$" ;
print "RE=$re\n" ;

while ( <> ) {
        chomp ;
        print $_, "\n" if /$re/ ;
} ;

bash subsequence matching speed-up

4 Answers4