Algorithm to match 2 lists with wildcards

Question

I'm looking for an efficient way to match 2 lists, one wich contains complete information, and one which contains wildcards. I've been able to do this with wildcards of fixed lengths, but am now trying to do it with wildcards of variable lengths.

Thus:

match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )

would return True as long as all the elements are in the same order in both lists.

I'm working with lists of objects, but used strings above for simplicity.

Are you working with characters/strings only? This sounds like a job for regular expressions. — aganders3, Jan 13 '12 at 07:36
No, unfortunately, I'm working with lists of objects. I suppose I COULD convert the objects to string representations first (and then use RE's) but I would much rather avoid such a workaround. I edited my post to clarify. — Joel Cornett, Jan 13 '12 at 18:12

score 5 · Accepted Answer · edited May 23 '17 at 10:28

5

[edited to justify no RE after OP comment on comparing objects]

It appears you are not using strings, but rather comparing objects. I am therefore giving an explicit algorithm — regular expressions provide a good solution tailored for strings, don't get me wrong, but from what you say as a comment to your questions, it seems an explicit, simple algorithm may make things easier for you.

It turns out that this can be solved with a much simpler algorithm than this previous answer:

def matcher (l1, l2):
    if (l1 == []):
        return (l2 == [] or l2 == ['*'])
    if (l2 == [] or l2[0] == '*'):
        return matcher(l2, l1)
    if (l1[0] == '*'):
        return (matcher(l1, l2[1:]) or matcher(l1[1:], l2))
    if (l1[0] == l2[0]):
        return matcher(l1[1:], l2[1:])
    else:
        return False

The key idea is that when you encounter a wildcard, you can explore two options :

either advance in the list that contains the wildcard (and consider the wildcard matched whatever there was until now)
or advance in the list that doesn't contain the wildcard (and consider that whatever is at the head of the list has to be matched by the wildcard).

edited May 23 '17 at 10:28

Community

1
1

answered Jan 13 '12 at 10:18

Francois G

11,957
54
59

This may run in exponential time if there are a lot of stars. – templatetypedef Jan 13 '12 at 19:24
Thank you, this is exactly what I needed. On a side note, is there a specific reason you use else: return False, instead of just returning false in the function block? – Joel Cornett Jan 13 '12 at 20:34
@templatetypedef Right. If needed, we can collapse successive pattern stars into a single one & then transform the recursion into an explicit imperative stack — to ensure 1 side of the alternative is explored before the other, *& then stopped w/o exploring the rest if it returns `True`* (probable if lots of stars). There are still exponential cases (e.g. 1 star every other char in both patterns), but they should start being rare. Anyway, isn't that sequencing of recursion the semantics of Python's eager evaluation already ? i.e. don't even look at the right call until the left has returned ? – Francois G Jan 14 '12 at 00:31
@huitseeker excellent work, and you overshot the mark. The OP specified that one list contained complete information. So, ten out of ten for style, but minus several million for good reading? Rounded up to +1E0 – Orwellophile Aug 15 '16 at 11:20

NPE · Answer 2 · 2012-01-13T08:39:11.897

1

How about the following:

import re

def match(pat, lst):
  regex = ''.join(term if term != '*' else '.*' for term in pat) + '$'
  s = ''.join(lst)
  return re.match(regex, s) is not None

print match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )

It uses regular expressions. Wildcards (*) are changed to .* and all other search terms are kept as-is.

One caveat is that if your search terms could contain things that have special meaning in the regex language, those would need to be properly escaped. It's pretty easy to handle this in the match function, I just wasn't sure if this was something you required.

edited Jan 13 '12 at 08:39

answered Jan 13 '12 at 08:34

NPE

486,780
108
951
1,012

1

How efficient is this? Could the behind-the-scenes construction of the matcher cause this to be exponentially slow? – templatetypedef Jan 13 '12 at 08:42

mathematical.coffee · Answer 3 · 2012-01-13T08:44:57.837

I'd recommend converting ['A', 'B', '*', 'D'] to '^AB.*D$', ['A', 'B', 'C', 'C', 'C', 'D'] to 'ABCCCD', and then using the re module (regular expressions) to do the match.

This will be valid if the elements of your lists are only one character each, and if they're strings.

something like:

import(re)
def myMatch( patternList, stringList ):
    # convert pattern to flat string with wildcards
    # convert AB*D to valid regex ^AB.*D$
    pattern = ''.join(patternList) 
    regexPattern = '^' + pattern.replace('*','.*') + '$' 
    # perform matching
    against = ''.join(stringList) # convert ['A','B','C','C','D'] to ABCCCD
    # return whether there is a match
    return (re.match(regexPattern,against) is not None)

If the lists contain numbers, or words, choose a character that you wouldn't expect to be in either, for example #. Then ['Aa','Bs','Ce','Cc','CC','Dd'] can be converted to Aa#Bs#Ce#Cc#CC#Dd, the wildcard pattern ['Aa','Bs','*','Dd'] could be converted to ^Aa#Bs#.*#Dd$, and the match performed.

Practically speaking this just means all the ''.join(...) becomes '#'.join(...) in myMatch.

How efficient is this? Could the behind-the-scenes construction of the matcher cause this to be exponentially slow? — templatetypedef, Jan 13 '12 at 08:42
I don't think you need to worry about overheads. the `''.join` is very fast, and the regex is quite simple (no lookarounds). — mathematical.coffee, Jan 13 '12 at 08:44

jcollado · Answer 4 · 2012-01-13T08:42:44.377

0

I agree with the comment regarding this could be done with regular expressions. For example:

import re

lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = ['A', 'B', 'C+', 'D']

print re.match(''.join(pattern), ''.join(lst)) # Will successfully match

Edit: As pointed out by a comment, it might be known in advance just that some character has to be matched, but not which one. In that case, regular expressions are useful still:

import re

lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = r'AB(\w)\1*D'

print re.match(pattern, ''.join(lst)).groups()

edited Jan 13 '12 at 08:42

answered Jan 13 '12 at 08:28

jcollado

39,419
8
102
133

1

But this presupposes that you know what symbol the + is supposed to match, and it also presupposes that it matches 1 or more copies. – templatetypedef Jan 13 '12 at 08:33
@templatetypedef Thanks for your comment. I've edited my answer to cover the case in which a character is matched without knowing which one in advance. I aggree on that I'm making some assumptions that might be useful only depending on the data that the OP is working with. – jcollado Jan 13 '12 at 08:46

score 0 · Answer 5 · answered Jan 13 '12 at 09:02

I agree, regular expressions are usually the way to go with this sort of thing. This algorithm works, but it just looks convoluted to me. It was fun to write though.

def match(listx, listy):
    listx, listy = map(iter, (listx, listy))
    while 1:
        try:
            x = next(listx)
        except StopIteration:
            # This means there are values left in listx that are not in listy.
            try:
                y = next(listy)
            except StopIteration:
                # This means there are no more values to be compared in either
                # listx or listy; since no exception was raied elsewhere, the
                # lists match.
                return True
            else:
                # This means that there are values in listy that are not in
                # listx.
                return False
        else:
            try:
                y = next(listy)
            except StopIteration:
                # Similarly, there are values in listy that aren't in listx.
                return False
        if x == y:
            pass
        elif x == '*':
            try:
                # Get the value in listx after '*'.
                x = next(listx)
            except StopIteration:
                # This means that listx terminates with '*'. If there are any
                # remaining values of listy, they will, by definition, match.
                return True
            while 1:
                if x == y:
                    # I didn't shift to the next value in listy because I
                    # assume that a '*' matches the empty string and well as
                    # any other.
                    break
                else:
                    try:
                        y = next(listy)
                    except StopIteration:
                        # This means there is at least one remaining value in
                        # listx that is not in listy, because listy has no
                        # more values.
                        return False
                    else:
                        pass
        # Same algorithm as above, given there is a '*' in listy.
        elif y == '*':
            try:
                y = next(listy)
            except StopIteration:
                return True
            while 1:
                if x == y:
                    break
                else:
                    try:
                        x = next(listx)
                    except StopIteration:
                        return False
                    else:
                        pass

score 0 · Answer 6 · answered Jan 13 '12 at 10:37

I had this c++ piece of code which seems to be doing what you are trying to do (inputs are strings instead of arrays of characters but you'll have to adapt stuff anyway).

bool Utils::stringMatchWithWildcards (const std::string str, const std::string strWithWildcards)
    PRINT("Starting in stringMatchWithWildcards('" << str << "','" << strWithWildcards << "')");
    const std::string wildcard="*";

    const bool startWithWildcard=(strWithWildcards.find(wildcard)==0);
    int pos=strWithWildcards.rfind(wildcard);
    const bool endWithWildcard = (pos!=std::string::npos) && (pos+wildcard.size()==strWithWildcards.size());

    // Basically, the point is to split the string with wildcards in strings with no wildcard.
    // Then search in the first string for the different chunks of the second in the correct order
    std::vector<std::string> vectStr;
    boost::split(vectStr, strWithWildcards, boost::is_any_of(wildcard));
    // I expected all the chunks in vectStr to be non-empty. It doesn't seem the be the case so let's remove them.
    vectStr.erase(std::remove_if(vectStr.begin(), vectStr.end(), std::mem_fun_ref(&std::string::empty)), vectStr.end());

    // Check if at least one element (to have first and last element)
    if (vectStr.empty())
    {
        const bool matchEmptyCase = (startWithWildcard || endWithWildcard || str.empty());
        PRINT("Match " << (matchEmptyCase?"":"un") << "successful (empty case) : '" << str << "' and '" << strWithWildcards << "'");
        return matchEmptyCase;
    }

    // First Element
    std::vector<std::string>::const_iterator vectStrIt = vectStr.begin();
    std::string aStr=*vectStrIt;
    if (!startWithWildcard && str.find(aStr, 0)!=0) {
        PRINT("Match unsuccessful (beginning) : '" << str << "' and '" << strWithWildcards << "'");
        return false;
    }

    // "Normal" Elements
    bool found(true);
    pos=0;
    std::vector<std::string>::const_iterator vectStrEnd = vectStr.end();
    for ( ; vectStrIt!=vectStrEnd ; vectStrIt++)
    {
        aStr=*vectStrIt;
        PRINT( "Searching '" << aStr << "' in '" << str << "' from  " << pos);
        pos=str.find(aStr, pos);
        if (pos==std::string::npos)
        {
            PRINT("Match unsuccessful ('" << aStr << "' not found) : '" << str << "' and '" << strWithWildcards << "'");
            return false;
        } else
        {
            PRINT( "Found at position " << pos);
            pos+=aStr.size();
        }
    }

    // Last Element
    const bool matchEnd = (endWithWildcard || str.rfind(aStr)+aStr.size()==str.size());
    PRINT("Match " << (matchEnd?"":"un") << "successful (usual case) : '" << str << "' and '" << strWithWildcards);
    return matchEnd;
}

   /* Tested on these values :
   assert( stringMatchWithWildcards("ABC","ABC"));
   assert( stringMatchWithWildcards("ABC","*"));
   assert( stringMatchWithWildcards("ABC","*****"));
   assert( stringMatchWithWildcards("ABC","*BC"));
   assert( stringMatchWithWildcards("ABC","AB*"));
   assert( stringMatchWithWildcards("ABC","A*C"));
   assert( stringMatchWithWildcards("ABC","*C"));
   assert( stringMatchWithWildcards("ABC","A*"));

   assert(!stringMatchWithWildcards("ABC","BC"));
   assert(!stringMatchWithWildcards("ABC","AB"));
   assert(!stringMatchWithWildcards("ABC","AB*D"));
   assert(!stringMatchWithWildcards("ABC",""));

   assert( stringMatchWithWildcards("",""));
   assert( stringMatchWithWildcards("","*"));
   assert(!stringMatchWithWildcards("","ABC"));
   */

It's not something I'm really proud of but it seems to be working so far. I hope you can find it useful.

Algorithm to match 2 lists with wildcards

6 Answers6