15

I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.

library(classInt)

my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
   user  system elapsed 
  13.46    0.00   13.45 

system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
   user  system elapsed 
   0.02    0.00    0.02

What makes the Jenks algorithm so slow, and is there a faster way to run it?

If need be I will move the last two parts of the question to stats.stackexchange.com:

  • Under what circumstances is kmeans a reasonable substitute for Jenks?
  • Is it reasonable to define classes by running classInt on a random 1% subset of the data points?
J. Win.
  • 6,662
  • 7
  • 34
  • 52
  • 2
    do read the help for functions. `kmeans` uses a random set of samples as initial cluster centres. To get reproducible results set a seed via `set.seed()` and read up about k-means and local vs global minima. This is mentioned in `?classIntervals`. – Gavin Simpson Mar 14 '11 at 21:00
  • Thanks Gavin. I found that part soon after posting and edited the question. – J. Win. Mar 14 '11 at 21:13
  • I don't think there's much evidence to suggest that jenk's breaks are any better than quantiles. – hadley Mar 15 '11 at 01:46
  • 5
    @hadley: That's hard to believe. Imagine your data represents the heights of 10 adults and 90 children. It should be clear that a good clustering algorithm tells you more than stuffing them into equal-sized quantiles. – J. Win. Mar 15 '11 at 02:35
  • 3
    Ok, but it's pretty unusual your get data with very clear clusters like that. Do you really think that your 100,000 points nicely cluster into only 5 clusters?! If so, I wish I could work with data like yours. – hadley Mar 15 '11 at 02:53
  • 2
    I don't know about 5 clusters, but there are definitely situations where you expect two clusters. Converting a grayscale image of part of a page into a black and white only image for optical character recognition is an excellent example for which quantiles will be very wrong, but two definite clusters are expected. –  Aug 17 '12 at 15:40

2 Answers2

11

To answer your original question:

What makes the Jenks algorithm so slow, and is there a faster way to run it?

Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.

However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.

# Install and load library
install.packages("BAMMtools")
library(BAMMtools)

# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

# Apply function
getJenksBreaks(x, 6)

The speed up is huge, i.e.

> microbenchmark( getJenksBreaks(x, 6, subset = NULL),  classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
                                      expr         min          lq        mean      median          uq         max neval cld
       getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771    10  a 
 classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846    10   
majom
  • 7,863
  • 7
  • 55
  • 88
2

From ?BAMMtools::getJenksBreaks

The Jenks natural breaks method was ported to C from code found in the classInt R package.

The two programs are the same; one is faster than the other because of their implementation (C vs R).

Drumy
  • 450
  • 2
  • 16