Finding the longest path in the Sling Blade Runner puzzle

Question

I've been trying to solve an archived ITA Software puzzle known as Sling Blade Runner for a few days now. The gist of the puzzle is as follows:

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.TXT. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.

The file MOVIES.TXT contains 6561 movie titles in alphabetical order.

My attempt at a solution has several parts.

Graph Construction:

What I did was map every movie title to every other movie title it could chain to (on it's right). What I end up with as my graph is a Map[String, List[String]]. You can see the graph that was built using this process here.

Graph Traversal:

I did a Depth First Search using every node (every key in the map) as a starting node of the search. I kept track of the depth at which each node was visited during the search, and this was tagged in the nodes returned by the DFS. What I ended up with was a List[List[Node]] where every List[Node] in the List was the DFS tree from a particular search.

Finding the longest chain:

I took the results of all the graph traversals in the previous step, and for every List[Node] I sorted the list by the depth values I tagged the nodes with previously, in descending order. Then starting with the head of the list (which gives me the deepest node visited in the DFS) I backtrack through the nodes to build a chain. This gave me a List[List[String]] where every List[String] in the List was the longest chain for that particular DFS. Sorting the List[List[String]] by the size of each List[String] and grabbing the head then gave me the largest chain.

Results:

The longest chain found with my algorithm was 217 titles long. The output can be viewed here.

I've only been able to find a few other attempts by Googling, and it seems every other attempt has produced longer chains than what I was able to accomplish. For example this post states that Eric Burke found a chain 245 titles long, and a user by the name of icefox on Reddit found a chain that was 312 titles long.

I can't think of where my algorithm is failing to find the longest chain, given other people have found longer chains. Any help/guidance is much appreciated. If you'd like to review my code, it can be found here (it's written in Scala and I just started learning Scala so forgive me if I made some noob mistakes).

Update:

I made some changes to my algorithm, which now finds chains of length 240+. See here

This looks like the right approach to me. The next step would be to look at one of the longer chains that someone else has found, and step through your code to work out why that chain isn't being picked up. — chiastic-security, Oct 05 '14 at 23:53
@chiastic-security I've spent the last several hours doing just that, and can't for the life of me find a problem with my algorithm. From what I can tell, it's exhaustive. — Christopher Perry, Oct 06 '14 at 06:06
I think your DFS search maybe the issue (I've not looked at your code yet so this may be wrong). You're finding one path from each node to every other node, but there may be several ways of getting there, probably of different lengths. — The Archetypal Paul, Oct 06 '14 at 10:49
The DFS on each node finds every path from that node. I pull the longest path from each tree spit out by the DFS. It must be something else. — Christopher Perry, Oct 06 '14 at 22:36
Does it? How do you handle loops? If you're just stopping searches because you've seen a node in the search already, then I think you are not getting all paths from your start node to some target nodes. "It must be something else" is flying in the face of the evidence a bit. Your DFS is NOT finding the longest path, if others can find longer ones. — The Archetypal Paul, Oct 07 '14 at 09:11
And David's answer puts this better than me. Your DFS approach finds some chain from a node to some other node. It doesn't necessarily find the longest such path. — The Archetypal Paul, Oct 07 '14 at 09:15
The solution you linked on reddit isn't correct. It has a chain TROUBLE EVERY DAY..EVERYDAY PEOPLE — xhassassin, Oct 22 '14 at 07:35
@xhassassin Nice find. That makes me feel a lot better about my solution. — Christopher Perry, Oct 22 '14 at 17:06

David Eisenstat · Accepted Answer · 2014-10-15T03:23:35.800

1

The issue is that, since the movie graph (I'm assuming) has cycles, no matter how you assign depths to the vertices of the cycle, there exists a subpath that is not monotone in the depth and thus is not considered by your algorithm. Sling Blade Runner is NP-hard, since we want no, so no known polynomial-time strategy is going to produce optimal solutions on every input.

(Sling Blade Runner isn't quite the NP-hard longest path problem, which specifies paths with no repeated vertices instead of no repeated arcs, but there is an easy polynomial-time reduction from the latter to the former. Split each vertex v into v_in -> v_out, moving arc heads to the in vertex and arc tails to the out vertex. Make additional arcs from a source vertex to another source vertex to each in vertex, and from each out vertex to a sink vertex to another sink vertex.

To find the longest path on the graph a->b, b->c, c->a, c->d, the input to Sling Blade Runner would be

s1->s2,
s2->a_in, s2->b_in, s2->c_in, s2->d_in,
a_in->a_out, b_in->b_out, c_in->c_out, d_in->d_out,
a_out->b_in, b_out->c_in, c_out->a_in, c_out->d_in,
a_out->t1, b_out->t1, c_out->t1, d_out->t1,
t1->t2.

The longest path problem forbids repeated vertices, so the optimal solution is a->b->c->d rather than c->a->b->c->d. The corresponding chain in Sling Blade Runner is s1->s2->a_in->a_out->b_in->b_out->c_in->c_out->d_in->d_out->t1->t2. The corresponding transformation of the path with a repeated vertex would repeat the arc c_in->c_out and thus be infeasible for Sling Blade Runner.)

Suppose that the movies titles are

S A
S B
A B
A E
B C
C D
D A
E F

, so that the graph looks like

    F
    ^
    |
    E
    ^
    |
S-->A-->B<--
|   ^   |   \
|   |   v   |
|   D<--C   |
\___________/

We start the DFS from S and get the following tree (because I said so; this is not the only possible DFS tree).

S-->A-->B-->C-->D
     \
      ->E-->F

. The depths are

S 0
A 1
B 2
C 3
D 4
E 2
F 3

, so the longest depth-monotone path is S A B C D. The longest path is S B C D A E F. If you start the DFS elsewhere, then you won't even assign S a depth.

A simpler example is

A B
B C
C D
D A

, where, no matter where you start, the optimal path, that goes all the way around the cycle, is not depth-monotone: A B C D A or B C D A B or C D A B C or D A B C D.

edited Oct 15 '14 at 03:23

answered Oct 06 '14 at 18:40

David Eisenstat

64,237
7
60
120

Thanks David, can you give an illustrative example in your answer? – Christopher Perry Oct 06 '14 at 22:35
But doesn't Chistopher's algorithm start the DFS from each node in turn? SO will pick up the longest path starting at B anyway? – The Archetypal Paul Oct 07 '14 at 09:13
@Paul that's what I was thinking. I do a DFS on A, then a DFS on B, then on C etc until I've done a DFS on every movie in the list. – Christopher Perry Oct 07 '14 at 09:24
Right, but what path do you find to each of the other nodes? There are cycles, so you must stop those. However, you need to return all paths between your start node and each of the other (reachable) nodes, not just one of the paths. Actually you need return only the longest, but that requires calculating all paths - and that's what leads to the NP-completeness. In other words, a DFS finds a node, and normally the path to it is not relevant but you also care about how it got there (in fact, all possible ways it could have got there) – The Archetypal Paul Oct 07 '14 at 10:30
@ChristopherPerry I added another arc to force the DFS root that I want, breaking the algorithm as written. – David Eisenstat Oct 07 '14 at 13:35
@DavidEisenstat I see now. The image really helped me see what is going on. I'll adjust my DFS and write a test based on your example. – Christopher Perry Oct 07 '14 at 21:16
@DavidEisenstat Can you please expand on what you said about splitting the vertices? I'm not seeing what you mean, and having a hard time finding anything online. – Christopher Perry Oct 15 '14 at 03:14
@ChristopherPerry I added some details. – David Eisenstat Oct 15 '14 at 03:23
@DavidEisenstat I appreciate the effort, but I must be dense. I'm still not understanding. Where are `s` and `t` coming from in your example? Can you expand your example more? I'm not at the PhD level. – Christopher Perry Oct 15 '14 at 04:23
@ChristopherPerry Sorry, those are the source and sink vertices respectively. – David Eisenstat Oct 15 '14 at 04:36
@DavidEisenstat You said `c_in->c_out` would be repeated, so transforming the vertices doesn't help solve Sling Blade Runner correct? – Christopher Perry Oct 15 '14 at 04:39
@ChristopherPerry We're trying to show that Sling Blade Runner is NP-hard. We do this using a counterfactual: if Sling Blade Runner were easy, then we could use the algorithm to solve longest path, so longest path would be easy too. We know that longest path is hard, though, so Sling Blade Runner must be hard as well. The comment that you quoted is the barest sketch of a proof that the reduction (machinery adapting the Sling Blade Runner algorithm to longest path) is correct. I'm hinting at the reason that the gadgetry that I've set up disallows non-simple paths after interpretation. – David Eisenstat Oct 15 '14 at 04:45