2

I have set (not list) of strings (words). It is a big one. (It's ripped out of images with openCV and tesseract so there's no reliable way to predict its contents.)

At some point of working with this list I need to find out if it contains at least one word that begins with part I'm currently processing. So it's like (NOT an actual code):

if exists(word.startswith(word_part) in word_set) then continue else break

There is a very good answer on how to find all strings in list that start with something here:

result = [s for s in string_list if s.startswith(lookup)]

or

result = filter(lambda s: s.startswith(lookup), string_list)

But they return list or iterator of all strings found. I only need to find if any such string exists within set, not get them all. Performance-wise it seems kinda stupid to get list, then get its len and see if it's more than zero and then just drop that list.

It there a better / faster / cleaner way?

Paul Alex
  • 65
  • 8
  • 1
    you could just `re.search("[\b]lookup_term",original_block_of_text))` ... but it still is O(N) ... you could make a set of the `word[:len(lookup_term)]` instead of the whole word ... but still takes O(N) to build that set (but then very fast lookup)... – Joran Beasley Dec 18 '19 at 04:40
  • 4
    If you want short-circuit, why not `if any(word.startswith(word_part) for word in wordset)`? – Chris Dec 18 '19 at 04:41
  • 1
    If you want to make this check more performant, you should store your strings in a prefix trie (see this question https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python or use a real library), otherwise just use `any` as others have already mentioned. – Boris Verkhovskiy Dec 18 '19 at 05:00
  • `sort` + `bisect` + `startswith`. – Stefan Pochmann Dec 18 '19 at 05:20
  • Joran Beasely, no, that won't do, I need to do that multiple times. with different-*length* starting letter sets. so building a different list for looking up is too performance-heavy. Chris, yes, that's exactly what I need, I'm just beginning coding in python so I didn't know about "any" yet, thank you! Boris, that's too data-science for a former web developer like me. Not my level yet, but thanks for pointing. I'll look in that direction later, when I'll have more experience. – Paul Alex Dec 18 '19 at 05:29

2 Answers2

3

Your pseudocode is very close to real code!

if any(word.startswith(word_part) for word in word_set):
    continue
else:
    break

any returns as soon as it finds one true element, so it's efficient.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • exactly what I need, thank you! I'm just starting learning python (about 10 days ago wrote my very first script that's more than "print('this is a drill')"-lookalikes) so some things I just didn't encounter yet. – Paul Alex Dec 18 '19 at 05:30
2

You need yield:

def find_word(word_set, letter):
    for word in word_set:
        if word.startswith(letter):
            yield word
    yield None
if next(find_word(word_set, letter)): print('word exists')

Yield gives out words lazily. So if you call it once, it will give out only one word.

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
  • 3
    i dont think you need the cast to iter ... generators automatically are iterators `next(find_word(word_set,letter))` – Joran Beasley Dec 18 '19 at 04:48
  • is this faster / lighter than above example? it surely is more code :) but, definitely _way_ more human-readable! – Paul Alex Dec 18 '19 at 05:31
  • 1
    Timed it, and this one is pretty faster as compared to the other one. But now I am not exactly sure why. It is almost like the implementation of any, which is used in the other answer. May be @JoranBeasley can elaborate? – Sayandip Dutta Dec 18 '19 at 05:52
  • 1
    based on dis.dis it has to maybe build an extra function ... but honestly im a bit surprised that its faster to do this than any ... not blown away but surprised ... but i think its pretty close to the same between both methods ... – Joran Beasley Dec 18 '19 at 06:51
  • 1
    I would not be blown away as well, until I found [this massive difference](https://stackoverflow.com/questions/59386919/why-is-any-slower-than-yield-for-checking-if-any-element-exists-matching-conditi) – Sayandip Dutta Dec 18 '19 at 07:00