0

I'm trying to write an algorithm for computing a maximal suffix tree / trie /array for a list of "unordered strings" (i.e. sets of characters). The goal is to minimally represent these sets in memory. Is there such an algorithm already?

As an example, given the following input:

[{"A", "B", C"}, {"A", "C", "D"}, {"C", "A"}, {"C", "A"}, {"A"}, {"A", "D"}, {"A", "E"}]

I would like to output something like the following:

                      A
                    / | \
                   C  E  D
                  / \
                 B   D

I've found a simple greedy approach of choosing the most common character across sets usually works pretty well, but I'd like to do something optimal if the overhead is reasonable.

Edit: There were some additional requirements I neglected to include:

  1. It is important that the representation is a DAG.
  2. Each set element should have an unambiguous parent.

Also, if there are disjoint sets, a forest as an output would probably be the most natural representation.

  • Please edit to add the other requirements you mentioned in a comment on a now-deleted answer. Separately, what if there is a pair of disjoint character sets (e.g., `{"F", "G"}` was added)? Do you then have multiple trees? – j_random_hacker May 11 '21 at 05:45
  • I'll add the edit. I'll also specify that behavior too. – Alexander Brassel May 12 '21 at 01:37

0 Answers0