-1

Question - Write a function called answer(document, searchTerms) which returns the shortest snippet of the document, containing all of the given search terms. The search terms can appear in any order.

Inputs:
(string) document = "many google employees can program"
(string list) searchTerms = ["google", "program"]
Output:
(string) "google employees can program"

 Inputs:
(string) document = "a b c d a"
(string list) searchTerms = ["a", "c", "d"]
 Output:
(string) "c d a"

My program below is giving the correct answer but the time complexity is very high since I am doing the Cartesian product. If the input is very high then I am not able to clear to those test cases. I am not able to reduce the complexity of this program, and any help will be greatly appreciated. Thanks

import itertools

import sys

def answer(document, searchTerms):

    min = sys.maxint

    matchedString = ""

    stringList = document.split(" ")

    d = dict()

    for j in range(len(searchTerms)):

        for i in range(len(stringList)):

            if searchTerms[j] == stringList[i]:

                d.setdefault(searchTerms[j], []).append(i)

    for element in itertools.product(*d.values()):

        sortedList = sorted(list(element))

        differenceList = [t - s for s, t in zip(sortedList, sortedList[1:])]

       if min > sum(differenceList):

          min = sum(differenceList)
          sortedElement = sortedList

          if sum(differenceList) == len(sortedElement) - 1:
            break

    try:
        for i in range(sortedElement[0], sortedElement[len(sortedElement)-1]+1):

            matchedString += "".join(stringList[i]) + " "

    except:
        pass

    return matchedString

If anyone wants to clone my program here is code

python
  • 4,403
  • 13
  • 56
  • 103

2 Answers2

1

One solution would be to iterate through the document using two indices (start and stop). You simply keep track of how many of each of the searchTerms are between start and stop. If not all are present you increase stop until they are (or you reach the end of the document). When all are present you increase start until before all searchTerms are no longer present. Whenever all searchTerms are present you check if that candidate is better than previous candidates. This should be able to be done in O(N) time (with limited number of search terms or the search terms are put in a set with O(1) lookup). Something like:

start = 0
stop = 0
counts = dict()
cand_start = None
cand_end = None

while stop < len(document):
    if len(counts) < len(searchTerms):
         term = document[stop]
         if term in searchTerms:
             if term not in counts:
                  counts[term] = 1
             else:
                  counts[term] += 1
    else:
        if cand_start is None or stop-start < cand_stop-cand_start:
           cand_start = start
           cand_stop = stop
        term = document[start]
        if term in counts:
            if counts[start] == 1:
               del counts[start]
            else:
               counts[start] -= 1
        start += 1
skyking
  • 13,817
  • 1
  • 35
  • 57
  • Thanks for your comment. The main problem is the Cartesian product loop which I need to work on. The dict loop is not taking much time. – python Aug 31 '15 at 05:41
  • Your initial algorithm failed many test cases, let me try this one. – python Aug 31 '15 at 06:43
  • My intention with the code is mostly to convey the general idea. There maybe room for improvements or corrections. You should probably not use code that you don't understand anyway... – skyking Aug 31 '15 at 06:50
  • I will try to rewrite your code, and see if I can make any improvements. Highly appreciate your help! Thanks – python Aug 31 '15 at 06:51
1

The Aho-Corasick algorithm will search a document for multiple search terms in linear time. It works by building a finite state automaton from the search terms, and then running the document through that automaton.

So build the FSA and start the search. As search terms are found, store them in an array of tuples (search term, position). When you've found all of the search terms, stop the search. The last item in your list will contain the last search term found. That gives you the ending position of the range. Then search backwards in that list of found terms until all of the terms are found.

So if you're searching for ["cat", "dog", "boy", "girl"], you might get something like:

cat - 15
boy - 27
cat - 50
girl - 97
boy - 202
dog - 223

So you know the end of the range is 226, and searching backward you find all four terms, with the last one being "cat" at position 50.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351