How do I make clusters based on a fixed capacity of each cluster in R?

Question

So I have a dataset with 600 points, their latitude, longitude, and demands. I have to make clusters such that for each cluster the points will be near each other and the total capacity of that cluster will not exceed a certain limit.

A sample dataset for the problem:

set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

What I want to have approximately:

What I am getting (cluster boundaries are approximated):

The code I've written:

library(tidyverse)
constrained_cluster <- function(df,capacity=170){
  lon_max <- max(df$lon)
  lat_max <- max(df$lat)

#Calculating the distance between an extreme point and all other points

  df$distance<-6377.83*acos(sin(lat_max*p)*sin(df$lat*p) + cos(lat_max*p)*cos(df$lat*p) * cos((lon_max-df$lon)*p))
  df<- df[order(df$distance, decreasing = FALSE),]
  d<-0
  cluster_number<-1
  cluster_list<- c()
  i<-1
#Writing a loop to form the cluster which will fill up the cluster_list accordingly
  while (i <= length(df$distance)){
    d <- d+ df$demand[i]
    if(d<=capacity){
      cluster_list[i] <- cluster_number
      i<- i+1
    }
    else{
      cluster_number <- cluster_number+1
      d <- 0
      i<-i
    }
  }
#Return a dataframe with the list of clusters
  return(cbind(df,as.data.frame(cluster_list)))
}

df_with_cluster<- constrained_cluster(df, capacity = 1000)

Could you state your problem as an optimization model? You could then try to solve the model directly, instead of relying on standard clustering techniques. — Enrico Schumann, May 19 '21 at 12:30
@EnricoSchumann Can try doing that. Any idea (or resources) on how to approach that? — Shibaprasadb, May 19 '21 at 16:49

score 1 · Accepted Answer · answered May 21 '21 at 07:53

Here is one possible approach, in which I treat the problem directly as an optimisation problem.

Suppose you have a feasible partition of the rows into groups. Not necessarily a good one, but one that does not violate the constraints. For every group (cluster), you compute the centre. Then you compute the distances of all points in a group to the group's centre, and sum them. In this way, you have a measure of quality of your initial partition.

Now, randomly pick on row, and move it into another group. You get new solution. Complete the steps as before, and compare the new solution's quality with the previous one. If it's better, keep it. If it's worse, stay with the old solution. Now repeat this whole procedure for a fixed number of iterations.

This process is called a Local Search. Of course, it is not guaranteed it will take you to an optimum solution. But it will likely give you a good solution. (Note that k-means implementations are typically stochastic as well, and there is no guaranty for an "optimal" partition.)

The good thing about a Local Search is that it gives you much flexibility. For instance, I assumed you started with a feasible solution. Suppose you make a random move (i.e. move one row into another cluster), but now this new cluster is to big. You could now simply discard this new, infeasible solution, and draw a new one.

Here is a code example, really just an outline; but with luck it is useful for you.

set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

Fix a number of clusters, k.

k <- 5

Start with kmeans and plot the solution.

par(mfrow = c(1, 3))
km <- kmeans(cbind(df$lat, df$lon), centers = k)

cols <- hcl.colors(n = k, "Cold")
plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "kmeans")
for (i in seq_len(k)) {
    lines(df$lon[km$cluster == i],
          df$lat[km$cluster == i],
          type = "p", pch = 19,
          col = cols[i])
}

Now a Local Search. I use an implementation in package NMOF (which I maintain).

library("NMOF")

## a random initial solution
x0 <- sample(1:k, length(id), replace = TRUE)
X <- as.matrix(df[, 2:3])

The objective function: it takes a partition x and computes the sum of distances, for all clusters.

sum_diff <- function(x, X, k, ...) {
    groups <- seq_len(k)
    d_centre <- numeric(k)
    for (g in groups) {
        centre <- colMeans(X[x == g, ], )
        d <- t(X[x == g, ]) - centre
        d_centre[g] <- sum(sqrt(colSums(d * d)))
    }
    sum(d_centre)
}

The neighbourhood function: it takes a partition and moves one row into another cluster.

nb <- function(x, k, ...) {
    groups <- seq_len(k)

    x_new <- x
    p <- sample.int(length(x), 1)
    g_ <- groups[-x_new[p]]    
    x_new[p] <- g_[sample.int(length(g_), 1)]
    x_new
}

Run the Local Search. I actually use a method called Threshold Accepting, which is based on Local Search, but can move away from local minima. See ?NMOF::TAopt for references on that method.

sol <- TAopt(sum_diff,
             list(x0 = x0,
                  nI = 20000,
                  neighbour = nb),
             X = as.matrix(df[, 2:3]),
             k = k)

Plot the solution.

plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "Local search")
for (i in seq_len(k)) {
    lines(df$lon[sol$xbest == i],
          df$lat[sol$xbest == i],
          type = "p", pch = 19, 
          col = cols[i])
}

Now, one way to add a capacity constraint. We start with a feasible solution.

## CAPACITY-CONSTRAINED
max.demand <- 6600
all(tapply(df$demand, x0, sum) < max.demand)
## TRUE

The constraint is handled in the neighbourhood. If the new solution exceeds the capacity, it is discarded.

nb_constr <- function(x, k, demand, max.demand,...) {
    groups <- seq_len(k)
    x_new <- x
    p <- sample.int(length(x), 1)
    g_ <- groups[-x_new[p]]    
    x_new[p] <- g_[sample.int(length(g_), 1)]

    ## if capacity is exceeded, return
    ## original solution
    if (sum(demand[x_new == x_new[p]]) > max.demand)
        x
    else
        x_new
}

Run the method and compare the results.

sol <- TAopt(sum_diff,
             list(x0 = x0,
                  nI = 20000,
                  neighbour = nb_constr),
             X = as.matrix(df[, 2:3]),
             k = k,
             demand = df$demand,
             max.demand = max.demand)

plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "Local search w/ constraint")
for (i in seq_len(k)) {
    lines(df$lon[sol$xbest == i],
          df$lat[sol$xbest == i],
          type = "p", pch = 19,
          col = cols[i])
}

all(tapply(df$demand, sol$xbest, sum) < max.demand)
## TRUE

This is really just an example and could be improved. For instance, the objective function here recomputes the distance of all groups, when it would only need to look at the changed groups.

Thank you very much for the help! This is working fine for this sample data. But when the demand is right-skewed, it is not giving the right clusters. Even the formed clusters aren't within the demand limit. @Enrico Schumann — Shibaprasadb, May 31 '21 at 04:18
Could you please provide example data? Note that the code requires a feasible initial solution, i.e. an arbitrary solution in which the limits are no exceeded. — Enrico Schumann, Jun 01 '21 at 05:07

score 0 · Answer 2 · answered May 18 '21 at 12:21

0

Something like this might get you started?

nmax <- 100
num.centers <- 1
km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
#check if there are no clusters larger than nmax
while (prod(km$size < nmax) == 0) {
  num.centers <- num.centers + 1
  km <- kmeans(cbind(df$lat, df$lon), centers = num.centers)
}
plot(df$lon, df$lat, col = km$cluster, pch = 20)

answered May 18 '21 at 12:21

Wimpel

26,031
1
20
37

Thanks for the help! One question: It just checks the size of the clusters, right? Whether it is exceeding nmax or not. – Shibaprasadb May 19 '21 at 04:16
1

Yes, using a while loop – Wimpel May 19 '21 at 06:54

How do I make clusters based on a fixed capacity of each cluster in R?

2 Answers2