K-Means Clustering a list of US addresses based on drive time

Question

I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based off drive time. How will I need to manipulate my data set to make it usable?

Thank you in advance for any feedback. I am by no means a great coder, I have taken only a few introductory courses back in college.

This website is a *programming* website. You really should try to *code* something to ask a good question here. It's not the appropriate site for stats questions, in particular not ones that have been answered before (k-means minimizes *variance*, not driving time). — Has QUIT--Anony-Mousse, Jun 15 '15 at 23:11
This is a programming question. I need to code something that can make decisions about what regions our groups should be in and what regions new groups will be assigned to. There will be other dimensions in the 'similarity' vector, and k-means clustering is the machine-learning method that seems to best make these decisions. — elied0327, Jun 16 '15 at 15:09
Well, then start coding, instead of asking, if you already know the answer... — Has QUIT--Anony-Mousse, Jun 16 '15 at 17:34

score 0 · Accepted Answer · answered Jun 15 '15 at 23:52

0

I think you are looking for "path planning" rather than clustering. The traveling salesman problem comes to mind

If you want to use clustering to find the individual regions you should find the coordinates for each location with respect to some global frame. One example would be using Latitude and longitude coordinates. Create an array X thats 155x2 where each row is a destination with the columns lat,long Then simply run matlab's kmeans something like

[idx,C] = kmeans(X,8);

should work nicely. This should be enough to get you started.

One issue with this approach is that it will group the sites by geographical location. Which isn't always the same as shortest travel time. For instance,

distance from (site A, site B) = 0.5 miles
distance from (site A, site C) = 2.0 miles

but getting from A-B requires going around a river and actual travel distance is 10 miles, whereas A-C is realistically 2.5 miles, clearly A-C is the better choice, but using global position alone wouldn't take this into account

answered Jun 15 '15 at 23:52

andrew

2,451
1
15
22

The traveling salesman problem certainly comes into play once the regions have been assigned to each consultants. My thought was cluster the groups, it just so happens the vector is one dimensional with that one dimension being an actual physical location (I'm sure there are other dimensions that I can add to this 'similarity' vector, but I haven't done so yet). It will be interesting to code something that can make a decision on what region new groups will be in, and then periodically re-order the regions. Again, the relevant dimensions will need to be determined, and drive time incorporated – elied0327 Jun 16 '15 at 15:05
k-means doesn't work well on geographical data. Because *squared Euclidean distance* on (latitude, longitude) can be quite different from *actual* geographic distance... – Has QUIT--Anony-Mousse Jun 16 '15 at 17:35
@Anony-Mousse, you are 100% correct. As I stated in my answer this is certainly a downfall of my approach. elied0327 If you were to incorporate the distance into clustering (just as with Traveling salesman) I believe the clustering would become NP-hard because you are trying to find the shortest distance (NP-hard), and then also group them on top of that. Maybe looking into something like "path planning for cooperative multi-robot networks" might provide optimized solutions, but I also think they would be highly complex and the cost of implementing them may outweigh the benefits – andrew Jun 16 '15 at 17:44

score 0 · Answer 2 · answered Jun 16 '15 at 09:34

0

This looks more like an integer optimization problem. It has nothing to do with clustering.

Reminds me of the case study "Assigning Regions to Sales Representatives [SRs] at Pfizer Turkey" by Murat Köksalan and Sakine Batun, INFORMS Transactions on Education 9(2), p.70-71, January 2009. http://pubsonline.informs.org/doi/abs/10.1287/ited.1090.0021ca .

I had to solve a simplified version of the problem in a MOOC recently.

"Since the SRs have to visit the MDs in their offices, it is important to minimize the total distance traveled by the SRs. This is the objective function. Each SR has an office in a certain brick, called their "center brick". We will compute the total distance traveled by an SR as the sum of the distances between the center brick and every other brick in that SR's territory."

You can optimize this for certain criteria. I cannot give any more details here because it is quite complicated.

answered Jun 16 '15 at 09:34

knb

9,138
4
58
85

Hey there @knb, I followed your pubsonline link and it piqued my curiosity since I have exactly this problem and that paper seems to talk about the solution (but doesn't actually show a solution). Would you happen to have the "teacher materials" it refers to? Or perhaps you could link me to the MOOC you were in?.. I'm trying to establish service zones for a business that visits those zones on different days... but I also need a way to figure out which zone an unseen / new address should belong to... say, when a new customer signs up. – zelusp Nov 16 '18 at 06:31
1

@zelusp Sorry I don''t remember much about this problem. The MOOC is called "[The Analytics Edge](https://www.edx.org/course/the-analytics-edge)" and as far as I know it is still offered by MIT as a self-paced course. Have a look. You must be enrolled to see Assignment 9 (meanwhile, they might have changed the case studies). I still have the course materials on a hard disk but not on this computer I'm sitting at right now. – knb Nov 16 '18 at 08:58

K-Means Clustering a list of US addresses based on drive time

2 Answers2