Traveling Salesman Variation: Building a tour of baseball stadiums

Question

I'm trying to write a Java program to find the best itinerary to do a driving tour to see a game at each of the 30 Major League Baseball stadiums. I define the "best" itinerary using the metric Miles Driven + (200 * Number of Days on the Road); this eliminates tours that are 20,000 miles over 30 days, or 11,000 miles over 90 days, neither of which would be a trip that I'd want to take. Each team plays 81 home games over the course of a 183-day season, so when a team is at home needs to be taken into consideration.

Also, I'm not just looking for one best tour for the entire baseball season. I'm looking to find the best tour that starts/ends at any given MLB city, on any given date (Detroit on June 15, Atlanta on August 3, etc.).

I've got the program producing results that I'm pretty happy with, but it would take a few months to run to completion on my laptop, and I'm wondering if anyone has any ideas for how to make it more efficient.

The program runs iteratively. It starts with a single game; say, Chicago on April 5. It figures out which games you could get to next within the next day or two after the Chicago game; let's say there are two such games, in Cincinnati and Detroit. It creates a data structure containing all the stops on each prospective tour (one for Chicago-Cincinnati, one for Chicago-Detroit). Then, it does the same thing to find prospective 3rd stops for both of the 2-stop tours, and so on, until it gets to the 30th and last stop, at which point it ascertains the best tour.

It uses a couple of methods to prune inefficient tours as it goes. The main one is employed using a HashMap. The key is a character sequence that denotes (1) which ballparks have already been visited, and (2) which was the last one visited. So it would run into a duplicate on, say, A-B-C-D-E and A-D-B-C-E. It would then keep the shorter route and eliminate the longer one. For most starting points, this keeps the maximum number of tours on the list at any given time at around 20 million, but for some starting points, it gets up to around 90 million.

So ... any ideas?

Also, if you're wondering what in the world I'd need something like this for, it's for a searchable-database website that I'm doing as a hobby (http://www.bestballparktour.com, if you're interested). The data set that I have loaded up now, I'm not happy with; I took too many shortcuts and there are too many cases where you can find better tours if you consult the MLB schedule. Also, I know that the website is not very attractive looking; I'm not a real web programmer. I'll probably try to make it look snazzier after I get the database fixed. — matt1414, Jun 01 '15 at 11:21
Are you familiar with the A* (A star) algorithm? I'm honestly not 100% familiar with the domain of your problem, but it seems to me that you should be able to define an estimate metric for A* that would greatly increase your efficiency versus a naive breadth-first search as the one you are doing. — Diego Martinoia, Jun 01 '15 at 11:29
I'd say don't process a given tour if the next hop is too long in either days or miles, OR miles per day. You've said 20k miles over 30 days, this is 600 miles per day on average, that is 80mph if driving for 8 hours - isn't it too fast? So, set a threshold "too far", "too late", "too speedy" and drop tours if one hop exceeds either. (I'd say these thresholds are optional, as some people would prefer a single trip from eastern shore to western, if all the other trips will be decently short.) — Vesper, Jun 01 '15 at 11:42
@DiegoMartinoia The problem with A* implementation here is that A* is usually limited by search space, while here a full graph between all the stadiums is assumed, and also distances between nodes are variable by time. Say, if two neighboring stadiums A and B host games on today and tomorrow, then A hosts a game a week afterwards, and B then hosts a game two weeks afterwards, the A-B distance is 1 day early and 2 weeks later, while B-A distance is one week early and unknown later. So I think A* cannot be implemented here at all. — Vesper, Jun 01 '15 at 11:46
@Vesper , I thought that we could represent our nodes in the graph as the location+time pair (i.e. one node == one game), so that all your distances are well-defined (going back in time == Infinite distance). But, again, I probably didn't completely understood the problem. — Diego Martinoia, Jun 01 '15 at 11:54
@DiegoMartinoia The problem is to build a route (one-way, apparently) that goes through all the stadiums to visit one game in each, so the nodes for A* should be stadiums, not pairs of stadium+time. Sadly, for 30 stadiums the complete solution set is "29!" or 29-factorial big due to starting node being fixed, which is too big for OP's laptop to crunch without effective culling. He's using a greedy approach aka grab those you can reach right here right now, two tops, which has a size of 2^29, with majority of culling available only at latest stages of search. — Vesper, Jun 01 '15 at 12:01
@Vesper I am using a greedy approach, but I'm not topping it out at two. There can be 10 or more, it all depends on what games are scheduled in which locations on which dates. I'm also doing culling throughout. I have too far/too late/too speedy limitations, and I also have rules like "Do all 7 of the West Coast sites at once," so if a tour goes out to California and only does three ballparks and then goes back east, it gets dropped. Also, the two Florida ballparks need to be done back-to-back, and any tour gets dropped once it hits 18,000 miles. — matt1414, Jun 01 '15 at 12:35

score 0 · Answer 1 · answered Jun 02 '15 at 03:18

Your algorithm is not actually greedy -- it enumerates all possible tours (pruning the obviously bad ones as you go). A greedy algorithm looks only one step ahead, makes the best decision for that step, then moves the next step ahead, etc. They are not exact but they are very fast. I would suggest adapting a standard greedy-type TSP heuristic for your problem. There are many common ones -- nearest neighbor, cheapest insertion, etc. You'll find various online sources describing them if you aren't already familiar with them.

You could also create duplicate stadium nodes, one for each home game, and then model this as a generalized TSP (GTSP) in which each "cluster" consists of the nodes for a given stadium. The distance from node (i_1,j_1) to (i_2,j_2) (where i = stadium and j = date) is defined by your metric.

Technically this is a TSP with time windows, but it's more complicated than the usual definition of the TSPTW because usually each node has a single contiguous time window (e.g., arrive between 8 am and 2 pm) whereas here you have a disjoint set of windows and you must choose one of them.

Hope this helps.

Traveling Salesman Variation: Building a tour of baseball stadiums

1 Answers1