String parsing using Python?

Question

Given a string such as 'helloyellowellow', parse all the valid strings from the given string. (Eg: [[hell,hello,yellow],[low, low]........]

I am looking for the most optimized way to write the code. Here is mine but I am not sure if this is the best way.

Full disclosure - This was an interview question

master = []

#   Dictionary for us to look up words   
def is_word(inputstr):
    #returns True/False


def processstring(fstr,secstr,li):
    if is_word(fstr): 
        li.append(fstr)
    if len(secstr) == 0:
        if len(li) != 0:
            master.append(li)
        return
    processstring(fstr+secstr[0], secstr[1:len(secstr)],li)



def wrapperprocess(inpstr):
    li = []
    if len(inpstr) == 0:
        return
    processstring('',inpstr,li)
    wrapperprocess(inpstr[1:len(inpstr)])


wrapperprocess('helloyellowellow')
print master

In your solution, looks like you forgot to `return li`. A better approach is to `yield` the matched words, instead of maintaining a list, appending to it, and returning it. — shx2, Oct 24 '13 at 18:27

shx2 · Answer 1 · 2013-10-24T18:20:31.557

Since you mentioned you're looking for an efficient algorithm, and assuming you get the dictionary in advance (and not just as a callable predicate), you can use the Aho–Corasick algorithm.

Of course, if the input text is short, a more naive algorithm will be faster, for avoiding the "expensive" pre-processing of the dictionary.

Plus, an alternative python-answer: here's a simple way to simply check each substring:

def gen_words(txt):
    n = len(txt)
    for i in range(n):
        for j in range(i+1, n+1):
            subtxt = txt[i:j]
            if is_word(subtxt):
                yield subtxt

For uniqueness, do:

all_words = set(gen_words(txt))

dawg · Answer 2 · 2013-10-24T20:24:38.897

2

You could do something like:

tgt='helloyellowellow'

with open('/usr/share/dict/words') as f:
    for word in f:
        word=word.strip()
        if word in tgt and len(word)>1:
            print word

Prints:

el
ell
he
hell
hello
lo
low
loy
ow
owe
we
well
ye
yell
yellow

If you are just looking for the function is_word that you have undefined, you can play with something like this:

def is_word(word, dic='/usr/share/dict/words'):
    if not hasattr(is_word, 'words'):
        with open(dic) as f:
            is_word.words={word.strip() for word in f}

    return word in is_word.words and len(word)>1

As a default data structure, Python sets have an average look-up time of O(1). You are very unlikely to write something on your own that is faster.

edited Oct 24 '13 at 20:24

answered Oct 24 '13 at 18:08

dawg

98,345
23
131
206

Thanks for the code. But, how is it efficient if you are looking up every single word from the dictionary to match with your string? Won't you be doing millions of matches when only a small subset of them match? – user2917012 Oct 24 '13 at 18:27
2

What is 'efficient' in this case? On my (old, slow) computer, this executes in 88 ms. Just printing 'hello' in Python takes 22 ms, so at 60ms more it is pretty fast. Only one word at a time is in memory, so it is pretty memory efficient. Since it took me about 30 seconds to write, it is pretty programmer efficient. In what way do you wish to be more efficient? ;-) – dawg Oct 24 '13 at 18:33

score 0 · Answer 3 · answered Oct 24 '13 at 18:12

It's good problem to solve with,

Use Wordnet package,

While parsing your given string start with some index and keep tormenting your index value for every incremental on the index, check the existence of the same word using wordnet, it will tell you weather the particular sub-string is a meaningful or not !

To install wordnet:

https://pypi.python.org/pypi/Wordnet-bn/1.0

String parsing using Python?

3 Answers3