1

I'm implementing a Minimalistic Acyclic Finite State Automaton (MA-FSA; a specific kind of DAG) in Go, and would like to associate some extra data with nodes that indicate EOW (end-of-word). With MA-FSA, the traditional approach is not possible because there are multiple words that might end at that node. So I'm looking into minimal perfect hashing functions as an alternative.

In the "Correction" box at the top of his blog post, Steve Hanov says that he used the method described in this paper by Lucchesi and Kowaltowski. In looking at Figure 12 (page 19), it describes the hashing function.

On line 8, it refers to FirstLetter and Predecessor(), but it doesn't describe what they are. Or I'm not seeing it. What are they?

All I can figure out is that it's just traversing the tree, adding up Number from each node as it goes, but that can't possibly be right. It produces numbers that are too large and it's not one-to-one, like the paper says. Am I misreading something?

Matt
  • 22,721
  • 17
  • 71
  • 112

2 Answers2

1

The paper says:

Let us assume that the representation of our automaton includes, for each state, an integer which gives the number of words that would be accepted by the automaton starting from that state.

So I believe this: for C <- FirstLetter to Predecessor(Word[I ]) do

Means: for (c = 'a'; c < word[i]; c++)

(They're just trying to be alphabet-independent.)

Think of it this way: enumerate all accepted words. Sort them. Find your word in the list. Its index is the word's hash value.

Their algorithm avoids storing the complete list by keeping track of how many words are reachable from a given node. So you get to a node, and check all the outgoing edges to other nodes that involve a letter of the alphabet before your next letter. All of the words reachable from those nodes must be on the list before your word, so you can calculate what position your word must occupy in the list.

Matt
  • 22,721
  • 17
  • 71
  • 112
  • Interesting idea; so at every node we iterate the letters of the alphabet lexicographically prior to the letter we're currently "at" and add their numbers too? I'll mull that over, but on the outset that doesn't make a lot of sense either. – Matt Oct 31 '14 at 18:57
  • "Let us assume that the representation of our automaton includes, for each state, an integer which gives the number of words that would be accepted by the automaton starting from that state." – user4203646 Oct 31 '14 at 19:10
  • 1
    Think of it this way: enumerate all accepted words. Sort them. Find your word in the list. Its index is the word's hash value. – user4203646 Oct 31 '14 at 19:11
  • 1
    Their algorithm avoids storing the complete list by keeping track of how many words are reachable from a given node. So you get to a node, and check all the outgoing edges to other nodes that involve a letter of the alphabet before your next letter. All of the words reachable from those nodes must be on the list before your word, so you can calculate what position your word must occupy in the list. – user4203646 Oct 31 '14 at 19:13
  • Ahhh, that makes much more sense. Clever! If you don't mind I'm going to inline the gist of your comments into the answer before I accept it. – Matt Oct 31 '14 at 22:36
1

I have updated my DAWG example to show using it as a Map from keys to values. Each node stores the number of final nodes reachable from it (including itself). Then when the trie is traversed, we add up the counts of any that we skip over. That way, each word in the trie has a unique number. You can then look up the number in an array to get the data associated with the word.

https://gist.github.com/smhanov/94230b422c2100ae4218

Steve Hanov
  • 11,316
  • 16
  • 62
  • 69