0
entry="Where in the world is Carmen San Diego"
goal=["Where in the", "world is", "Carmen San Diego"]

I am trying to create a procedure that will search for chunks of words within "entry" that are members of the "goal" list. I would like to preserve word order in these subsets.

This is what I have so far. I'm not really sure how to complete this or if I'm approaching it the right way.

span=1
words = entry.split(" ")
initial_list= [" ".join(words[i:i+span]) for i in range(0, len(words), span)]
x=len(initial_list)
initial_string= " ".join(initial_list)
def backtrack(A,k):
    if A in goal:
        print
    else:
        while A not in goal:
            k=k-1
            A= " ".join(initial_list[0:k])
            if A in goal:
                print A
                words=A.split(" ")
                firstmatch= [" ".join(words[i:i+span]) for i in range(0, len(words), span)]
                newList = []
                for item in initial_list:
                    if item not in firstmatch:
                        newList.append(item)
                nextchunk=" ".join(newList)             

backtrack(initial_string,x)

The output so far is just this:

"Where in the"

Desired Output:

"Where in the"
"world is"
"Carmen San Diego"

I've been spinning my wheels trying to find a proper algorithm for this, and I think it requires either backtracking or search pruning, I'm not really sure. Ideally, a solution would work for any "entry" and "goal" list. Any comments are much appreciated.

  • Your example isn't particularly helpful to understand what you're trying to do. If you have `entry = "abcabcdefdef"` with `goal = ["ab", "dd", "c"]`, what would you expect as output? – Brandon Humpert Oct 23 '14 at 22:14
  • @BrandonHumpert. In that situation I would expect nothing to print. Overall, this is a prototype. This "goal" list actually represents a body of successful JSON queries. "entry" will be a string inputted by the user. I want to break up this user entry into multiple query strings in the "backtrack" fashion I've described. Hope this is clearer. – courtorder52 Oct 24 '14 at 02:59

2 Answers2

0

Here's an idea: put your goal list into a trie. Find the longest matching prefix of your current entry string in the trie, and add it to the output if found.

Then find the next space in your current entry string (word separator), set your current entry string to the substring from the index after the space, and repeat until it's empty.

Edit: here's some code.

import string
import datrie

entry="Where in the world is Carmen San Diego"
goal=["Where in the", "world is", "Carmen San Diego"]

dt = datrie.BaseTrie(string.printable)
for i, s in enumerate(goal):
    dt[s] = i

def find_prefix(current_entry):
    try:
        return dt.longest_prefix(current_entry)
    except KeyError:
        return None

def find_matches(entry):
    current_entry = entry

    while(True):
        match = find_prefix(current_entry)
        if match:
            yield match
        space_index = current_entry.find(' ')
        if space_index > 0:
             current_entry = current_entry[space_index + 1:]
        else:
            return

print(list(find_matches(entry)))
w-m
  • 10,772
  • 1
  • 42
  • 49
0

Does this do what you want?

entry="Where in the world is Carmen San Diego"
goal=["Where in the", "world is", "Carmen San Diego"]


for word in goal:
    if word in entry:
        print(word)

It just searches the entry for each word and prints it if you find it.

If you want to save them to a list, or something, you can do something like this:

entry="Where in the world is Carmen San Diego"
goal=["Where in the", "world is", "Carmen San Diego"]
foundwords = []

for word in goal:
    if word in entry:
        foundwords.append(word)
rwflash
  • 188
  • 1
  • 5
  • thanks for your help. unfortunately "goal" here is just a list I am using to prototype; in actuality it will represent successful queries into an api, so I can't really do a loop through the set of all potentially successful queries – courtorder52 Oct 24 '14 at 03:04