I'm trying to write an algorithm for computing a maximal suffix tree / trie /array for a list of "unordered strings" (i.e. sets of characters). The goal is to minimally represent these sets in memory. Is there such an algorithm already?
As an example, given the following input:
[{"A", "B", C"}, {"A", "C", "D"}, {"C", "A"}, {"C", "A"}, {"A"}, {"A", "D"}, {"A", "E"}]
I would like to output something like the following:
A
/ | \
C E D
/ \
B D
I've found a simple greedy approach of choosing the most common character across sets usually works pretty well, but I'd like to do something optimal if the overhead is reasonable.
Edit: There were some additional requirements I neglected to include:
- It is important that the representation is a DAG.
- Each set element should have an unambiguous parent.
Also, if there are disjoint sets, a forest as an output would probably be the most natural representation.