3

So I have a problem that is basically like this: I have a bunch of strings, and I want to construct a DAG such that every path corresponds to a string and vice versa. However, I have the freedom to permute my strings arbitrarily. The order of characters does not matter. The DAGs that I generate have a cost associated with them. Basically, the cost of a branch in the DAG is proportional to the length of its child paths.

For example, let's say I have the strings BAAA, CAAA, DAAA, and I construct a DAG representing them without permuting them. I get:

() -> (B, C, D) -> A -> A -> A

where the tuple represents branching.

A cheaper representation for my purposes would be:

() -> A -> A -> A -> (B, C, D)

The problem is: Given n strings, permute the strings such that the corresponding DAG has the cheapest cost, where the cost function is: If we traverse the graph from the source in depth first, left to right order, the total number of nodes we visit, with multiplicity.

So the cost of the first example is 12, because we must visit the A's multiple times on the traversal. The cost of the second example is 6, because we only visit the A's once before we deal with the branches.

I have a feeling this problem is NP Hard. It seems like a question about formal languages and I'm not familiar enough with those sorts of algorithms to figure out how I should go about the reduction. I don't need a complete answer per se, but if someone could point out a class of well known problems that seem related, I would much appreciate it.

Charles
  • 50,943
  • 13
  • 104
  • 142
danharaj
  • 1,613
  • 14
  • 19
  • The cost function does not seem very clear. Perhaps you should elaborate? That said, if you don't get an answer here, you could consider cstheory.stackexchange.com. But please do so only after it is given due consideration here. –  May 14 '11 at 21:06
  • 1
    for each character in the alphabet count occurrences and map to the corresponding words. take the character with the most occurrences and make a node out of it, remove the character once from each associated word and recurse in that group. – sleeplessnerd May 14 '11 at 21:16
  • Re Aryabhatta: sorry that it is not clear. What I am doing is walking down each path in the dag. When I am done walking down a path, I move back up to where the last branch from that path occurred and walk down that path. The idea is that if I have already walked down the prefix of a path, I just have to visit the rest of it. I care about walking down each path in this manner, so I may visit a node multiple times if it lies on multiple paths. Does this make it clear? – danharaj May 14 '11 at 21:20
  • if it is minimal, its not a DAG anymore but a tree i think. please prove me wrong. – sleeplessnerd May 14 '11 at 21:22
  • Re sleeplessnerd: There are some instances where a DAG that is not a tree minimizes the cost, but I think in general there exists a tree that minimizes the cost. For example, a graph that minimizes BA, BE, and BBE can be a DAG that is not a tree. Incidentally, I noticed a flaw in my problem description, in that if I want your algorithm to work properly, I need to end each string with a terminal character. Other than that, I think your algorithm is correct and I feel silly for missing it. That's what happens when you decide a problem is hard at 3 AM without checking yourself in the morning! – danharaj May 14 '11 at 21:39
  • Can you permutate the strings independently? In your example, you changed position 4 to position 1 for all strings. Given ABC, DEF, can you treat ABC as BAC while fixing DEF? – Rob Neuhaus May 14 '11 at 21:40
  • Re rrenaud: Yes. In my previous comment I stated that I should add a terminal character to each string. If I do that, the only constraint is that I need to keep the terminal character fixed at the end. – danharaj May 14 '11 at 21:43
  • Reality check: this is a DAG with loops, correct? – ThomasMcLeod May 14 '11 at 22:19
  • Are you using a finite alphabet? – ThomasMcLeod May 14 '11 at 22:22
  • Re ThomasMcLeod: It has diamonds, if that's what you mean. And no, the alphabet isn't finite. Also after reading autismal's answer below, I now see that my 3 AM self wasn't so silly. I managed to construct an example where sleeplessnerd's algorithm does not work. – danharaj May 14 '11 at 22:26
  • By loop, I mean an edge from a vertex itself. If you're not using loops, then what does "() -> A -> A -> A -> (B, C, D)" in your question mean? Also, what was the counterexample? On stackexchange sites, if you use @ in your post, will get a noticfication. – ThomasMcLeod May 14 '11 at 22:54
  • @ThomasMcLeod , thanks for the tip about @. I'm new here. I made a drawing because drawing trees in text is cumbersome: [link](http://img215.imageshack.us/img215/9942/graphex.jpg) As for the counter example: Consider the strings AB, AC, AD, BE, CF, DG, representing the edges of a graph on 7 vertices. The greedy algorithm does not work here. – danharaj May 14 '11 at 23:05
  • @danharaj: Good point! - But assuming the problem is np-hard, the algorithm seems to be a reasonably good heuristic in O(n^3)-ish. (It is close to 3AM here right now, so dont rely on that ;) – sleeplessnerd May 15 '11 at 00:38

1 Answers1

2

To rephrase:

Given words w1, …, wn, compute permutations x1 of w1, …, xn of wn to minimize the size of the trie storing x1, …, xn.

Assuming an alphabet of unlimited size, this problem is NP-hard via a reduction from vertex cover. (I believe it might be fixed-parameter tractable in the size of the alphabet.) The reduction is easy: given a graph, let each vertex be its own letter and create a two-letter word for each edge.

There is exactly one node at depth zero, and as many nodes at depth two as there are edges. The possible sets of nodes at depth one are exactly the sets of nodes that are vertex covers.

qrqwe
  • 299
  • 1
  • 4
  • I believe you. Your reduction makes sense, and I was able to construct a counter-example to the algorithm suggested in the question comments. I guess I was thinking of a compressed version of a Trie where nodes can have multiple incoming edges? – danharaj May 14 '11 at 22:30