data structure for NFA representation

Question

In my lexical analyzer generator I use McNaughton and Yamada algorithm for NFA construction, and one of its properties that transition form I to J marked with char at J position.

So, each node of NFA can be represented simply as list of next possible states.

Which data structure best suit for storing this type of data? It must provide fast lookup for all possible states and use less space, but insertion time is not so important.

score 3 · Accepted Answer · answered Dec 31 '10 at 18:24

My understanding is that you want to encode a graph, where the nodes are states and the edges are transitions, and where every edge is labelled with a character. Is that correct?

The dull but practical answer is to have a object for each state, and to encode the transitions in some little structure in that object.

The simplest one would be an array, indexed by character code: that's as fast as it gets, but not naturally space-efficient. You can make it more space efficient by using a sort of offset, truncated array: store only the part of the array which contains transitions, along with the start and end indices of that part. When looking up a character in it, check that its code is within the bounds; if it isn't, treat it as a null edge (or an edge back to the start state or whatever), and if it is, fetch the element at index (character code - start). Does that make sense?

A more complex option would be a little hashtable, which would be more compact but slightly slower. I would suggest closed hashing, because collision lists will use too much memory; linear probing should be enough. You could look into using perfect hashing (look it up), which takes a lot of time to generate the table but then gives collision-free lookup. The generation process is quite complex, though.

A clever approach is to use both arrays and hashtables, and to pick one or the other based on the number of edges: if the compacted array would be more than, say, a third full, use it, but if not, use a hashtable.

Now, something a bit more radical you could do would be to use arrays, but to overlap them - if they're sparse, they'll have lots of holes in, and if you're clever, you can arrange them so that the entries in each array lines up with holes in the others. That will give you fast lookups, but also excellent memory efficiency. You will need some scheme for distinguishing when a lookup has found something from when it's found an empty slot with some other state's transition in, but i'm sure you can think of something.

Yes, there is some form of graph - but with labelled nodes (not edges), and each transition treat "marked" with label on node it point. — S.J., Dec 31 '10 at 21:11
Using overlapped arrays look interesting, I will think about it. Thank you. — S.J., Dec 31 '10 at 21:19
@S.J. Finding a good algorithm to do the overlapping might be a challenge. The only context I remember seeing this done in was generating overlapping vtables for interfaces in an old java VM, about ten years ago! Might be worth asking another question here about it. — Tom Anderson, Jan 02 '11 at 11:18

data structure for NFA representation

1 Answers1