Word segmentation using dynamic programming

Question

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:

We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:

iterative bottom-up version
recursive memorized version

Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.

So I'm having two main issues:

I know that this can and should be done in O(N^2) and I don't think mine is
The lookup table isn't adding all the words it seems such that it can reduce the time complexity

What I'd like:

Any kind of input (better way to do it, something you see wrong in the code, how I can get the lookup table working, how to use the table of booleans to build a sequence of valid words)
Some idea on how to tackle the recursive version although I feel once I am able to solve the iterative solution I will be able to engineer the recursive one from it.

As always thanks for any time and or effort anyone gives this, it is always appreciated.

Here's my attempt:

#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
    diction = open("diction10k.txt",'r') 
    for x in diction:
        x = x.strip("\n \r")
        if s == x:
            return True
    return False

def iterativeSplit(s):
    n = len(s)
    i = j = k = 0
    A = [-1] * n
    word = [""] * n
    booly = False
    for i in range(0, n):
        for j in range(0, i+1):
            prefix = s[j:i+1]
            for k in range(0, n):

                if word[k] == prefix:
                    #booly = True
                    A[k] = 1
                    #print "Array below at index k %d and word = %s"%(k,word[k])
                    #print A
            # print prefix, A[i]
            if(((A[i] == -1) or (A[i] == 0))):
                if (dictW(prefix)):
                    A[i] = 1
                    word[i] = prefix
                    #print word[i], i
                else:
                    A[i] = 0
    for i in range(0, n):
        print A[i]

Removing x = x.strip("\n \r") and replacing if s == x with if s.find(x) != -1 should give a speed-up. Of course, this won't give you exact matches e.g. 'hell' will be found if the word is 'hello'. — confused_at_times, Mar 05 '14 at 10:00
Interesting. Here is the results of the string lukelucklikeslakes with both versions. Mine real 0m15.266s user 0m2.858s sys 0m0.031s Yours real 0m12.906s user 0m3.264s sys 0m0.030s — xe0, Mar 05 '14 at 10:09
Yeah. I guess the direct string comparison would always be quicker than trying to find a sequence of strings within another string. I was assuming removing the strip call would be beneficial though. Is there a reason why the strip call is required at all? Could you try your test again with 10x as many entries? Also, you may want to open and store the lines from the file before the for loop starts, as you are now repeating that action i*j*k times. — confused_at_times, Mar 05 '14 at 10:29

score 4 · Answer 1 · answered Sep 02 '15 at 23:10

For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.

In particular, segment illustrates the memoization approach:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.

divyum · Answer 2 · 2015-04-15T22:18:47.110

0

Here is the solution in C++. Read and understand the concept, and then implement.

This video is very helpful for understanding DP approach.

One more approach which I feel can help is Trie data structure. It is a better way to solve the above problem.

edited Apr 15 '15 at 22:18

answered Apr 15 '15 at 19:22

divyum

1,286
13
20

Word segmentation using dynamic programming

2 Answers2

Linked