Efficiently finding the longest matching prefix string

Question

My current implementation is this:

def find_longest_matching_option(option, options):
    options = sorted(options, key=len)
    longest_matching_option = None
    for valid_option in options:
        # Don't want to treat "oreo" as matching "o",
        # match only if it's "o reo"
        if re.match(ur"^{}\s+".format(valid_option), option.strip()):
            longest_matching_option = valid_option
    return longest_matching_option

Some examples of what I'm trying to do:

"foo bar baz something", ["foo", "foo bar", "foo bar baz"]
# -> "foo bar baz"
"foo bar bazsomething", (same as above)
# -> "foo bar"
"hello world", ["hello", "something_else"]
# -> "hello"
"a b", ["a", "a b"]
# -> "a b" # Doesn't work in current impl.

Mostly, I'm looking for efficiency here. The current implementation works, but I've been told it's O(m^2 * n), which is pretty bad.

Thanks in advance!

If you're looking for efficiency, regex should be thrown! You should look at a solution with `str.startswith`. — cs95, Jan 15 '18 at 08:50
@cᴏʟᴅsᴘᴇᴇᴅ I did have that, but I it doesn't take into account the whitespace. Even if I added a space to each string in the `options` list, I still can't account for the other types of whitespace, or multiple spaces. — naiveai, Jan 15 '18 at 08:51
The examples you've given, seem to account for that. In fact something like `max(filter("foo bar baz something".startswith, ["foo", "foo bar", "foo bar baz"]), key=len)` would work nicely. — cs95, Jan 15 '18 at 08:52
@naiveai : you might first replace `\s+` with say `,` or a single space in all of your strings, and then use @coldspeed solution (or even more efficiently loop on sorted input and return the first match) — Faibbus, Jan 15 '18 at 08:56
What's the expected output of `find_longest_matching_option('a b', ['a', 'a b'])`? — Aran-Fey, Jan 15 '18 at 08:59
@Rawing `'a b'`, actually. Didn't think of that, and my current impl. doesn't work with that. — naiveai, Jan 15 '18 at 09:01
@naiveai Could you help me understand why it doesn't work? Are you trying to match entire words? — cs95, Jan 15 '18 at 09:05
@cᴏʟᴅsᴘᴇᴇᴅ No, I'm not trying to match words specifically. I just want to be as greedy as possible and grab everything to the point where there's a whitespace followed by a character that doesn't match any of the strings in my list. — naiveai, Jan 15 '18 at 09:08
@naiveai Understood. Both of our answers should have sufficiently addressed your problem. If there are any use cases that break, do let me (and Rawing respectively) know. — cs95, Jan 15 '18 at 09:24
@cᴏʟᴅsᴘᴇᴇᴅ, Rawing I really appreciate both your answers, I'm testing and trying to understand them right now. Will accept in a bit. — naiveai, Jan 15 '18 at 09:27
You're welcome. Whosever answer you decide to accept, remember you can upvote them both. Cheers. — cs95, Jan 15 '18 at 09:28

cs95 · Accepted Answer · 2018-01-15T09:43:39.913

2

Let's start with foo.

def foo(x, y):
    x, y = x.strip(), y.strip()
    return x == y or x.startswith(y + " ")

foo returns true either if two strings are equal, or one (plus a space) is a substring of the other.

Next, given a case string, and a list of options, you can use filter to find all valid substrings for the given case string, and then apply max to find the longest one (see tests below).

Here's a few test cases for foo. For the purpose of demonstrating, I'll use partial to curry foo to a higher order function.

from functools import partial

cases = ["foo bar baz something", "foo bar bazsomething", "hello world", "a b", "a b"]
options = [
      ["foo", "foo bar", "foo bar baz"], 
      ["foo", "foo bar", "foo bar baz"],
      ["hello", "something_else"],
      ["a", "a b"],
      ["a", "a b\t"]
]
p_list = [partial(foo, c) for c in cases]

for p, o in zip(p_list, options):
    print(max(filter(p, o), key=len))

foo bar baz
foo bar
hello
a b
a b

edited Jan 15 '18 at 09:43

answered Jan 15 '18 at 09:18

cs95

379,657
97
704
746

I'm accepting this answer mainly because I'm a sucker for functional-esque programming, but Rawing's answer also works correctly as far as I can see, and you should use it if you want a little more straightforward solution. – naiveai Jan 15 '18 at 09:37
@naiveai I'm "partial" to this solution myself, given the core functionality encompasses a whopping 3 lines. Complexity wise, the answers are (probably) the same, but Rawing's may be faster (in terms of the constants in the BigO) because it's a plain loop at C speed. – cs95 Jan 15 '18 at 09:39
1

Personally, I feel it's exceedingly elegant and fast *enough* for my needs. If I need to go faster I'd definitely go with Rawing's answer. – naiveai Jan 15 '18 at 09:41
By the way, could you suggest a title that I could use so this question is more easily searchable? – naiveai Jan 15 '18 at 09:44
@naiveai FTFY, the title. – cs95 Jan 15 '18 at 09:46

score 1 · Answer 2 · answered Jan 15 '18 at 09:11

Regex is overkill here; you can simply append a space to each string before comparing them to get the same result.

You also don't need to sort the data. It's more efficient to simply loop over every value.

def find_longest_matching_option(option, options):
    # append a space so that find_longest_matching_option("a b", ["a b"])
    # works as expected
    option += ' '
    longest = None

    for valid_option in options:
        # append a space to each option so that only complete
        # words are matched
        valid_option += ' '
        if option.startswith(valid_option):
            # remember the longest match
            if longest is None or len(longest) < len(valid_option):
                longest = valid_option

    if longest is not None:
        # remove the trailing space
        longest = longest[:-1]
    return longest

Efficiently finding the longest matching prefix string

2 Answers2