0

This is a little different from most trie problems on stackoverflow (yes, I've spent time searching and reading), so please bear with me.

I have FILE A with words like: allow*, apolog*, etc. There are in total tens of thousands of such entries. And I have FILE B containing a body of text, with up to thousands of words. I want to be able to match words in my text in FILE B with words in FILE A.

Example:

FILE B's "apologize" would match FILE A's "apolog*"

FILE B's "a" would neither match "allow*" nor "apolog*"

FILE B's "apologizetomenoworelseiwillkillyou" would also match FILE A's "apolog*"

Could anyone suggest an algorithm/data structure (that is preferably do-able in python) that could help me in achieving this? The tries I've looked into seem to be more about matching prefixes to whole words, but here, I'm matching whole words to prefixes. Stemming algorithms are out of the question because they have fixed rules, whereas in this case my suffix can be anything. I do not want to iterate through my entire list in FILE A, because that would take too much time.

If this is confusing, I'm happy to clarify. Thanks.

K L
  • 1,373
  • 2
  • 15
  • 21
  • 1
    I do not want to iterate through my entire list in FILE A If you dont iterate throught your FILE A, how do you know word in FILE B will match? – HVNSweeting Aug 03 '12 at 05:14

4 Answers4

1

Put all of your prefixes in a hashtable. Then take each word in B and look up all prefixes of it in the hashtable. Any hit you get indicates a match.

So the hashtable would contain "allow" and "apolog". For "apologize", you'd look up "a" then "ap", and so on, until you looked up "apolog" and found a match.

Keith Randall
  • 22,985
  • 2
  • 35
  • 54
1

If I understand what you're looking for, you want to be able to see all the prefixes from file A that match a given full word from file B. A trie data structure lets you match a single prefix to a list of full words, but you need to go the other direction.

If so, you might still use a trie to do the matching, using a lookup table to reverse the results. Here's the algorithm:

  • Iterate over the words from File B, putting them into a trie.
  • Iterate over the prefixes from File A, finding the matches from the trie.
    • For each match, add the prefix to a dictionary of lists, indexed by the matched word.

Here's some code implementing the algorithm. You need trie class named Trie and iterables passed in as the arguments words and prefixes (use generators if you don't want the values all in memory at the same time):

def matchPrefixes(words, prefixes):
    """Returns a word-to-prefix lookup table."""

    trie = Trie()
    for word in words:
        trie.add(word)

    lookupTable = defaultdict(list)
    for prefix in prefixes:
        matchedWords = trie.matchPrefix(prefix)

        for word in matchedWords:
            lookupTable[word].append(prefix)

    return lookupTable

This should be pretty efficient in both time and memory, especially when the list of words is much shorter than the list of prefixes. Prefixes that don't match any words will not use any memory after they've been checked against the trie.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
1

In the case that the number of words in FILE B is much larger than prefixes in FILE A, you can also build a Trie of the prefixes and match the words in it.

It will be much easier if you understand the way a Trie works. A Trie is a tree built from strings, as shown below. Matching a string in a Trie is a process of walking from the root to one of the leaves.

In your problem, if we put the prefixes in the Trie, and look for the words, we will need to mark some of the internal nodes in the Trie as the terminations of prefixes. When we look for a word in the Trie, every time we reach a node that is marked as the termination of a prefix, we add that prefix as "matched" to the word. (then we go on to the next letter).

This is exactly the reversed solution of @Blckknght's solution. Which of them is more efficient depends on which of FILE A and FILE B is larger.

In @Blckknght's solution, each node in the Trie is marked by the set of words (whose path) contains the node. The search of a prefix ends at the last letter of prefix. When it stops, we take the Trie node that the search stops at, then we add the set marked on the node as matched to the prefix.

I'll write some pesudo-code if it's helpful to anyone.

Trie from wiki, from which you can find some code in "Algorithms" part

enter image description here

lavin
  • 2,276
  • 2
  • 13
  • 15
  • It is a good point that tries *can* be used to directly match a full word to several prefixes. That's not a very common operation for them though, so if you're using a trie implementation from a library, it may not be available. – Blckknght Aug 03 '12 at 08:56
1

lets assume you have 100,000 words in each file.

sorting = n*log(n) binary search lookup = log(n)

so this is worse case n*log(n)

which is 100, 000 * log(100, 000) = 100,000 * 11 = 10^6 = almost instant

I don't think you need anything fancy with such small files. simply sort and binary search:

__author__ = 'Robert'

from bisect import bisect_right

file_a = """hell*
wor*
howard*
are*
yo*
all*
to*""".splitlines()

file_b = """hello world how are you all today too told town""".split()

a_starts = sorted(word[:-1] for word in file_a) #can do this easily if only 100, 000 words as you say.

def match(word):
    pos = bisect_right(a_starts, word)
    #assert 0 <= pos < len(a_starts)
    matched_word = a_starts[pos - 1]
    return matched_word if word.startswith(matched_word) else None

for word in file_b:
    print(word, " -> ", match(word))

"""
hello  ->  hell
world  ->  wor
how  ->  None
are  ->  are
you  ->  yo
all  ->  all
today  ->  to
too  ->  to
told  ->  to
town  ->  to
"""
Rusty Rob
  • 16,489
  • 8
  • 100
  • 116