I want to partition a vector (length around 10^5) into five classes. With the function classIntervals
from package classInt
I wanted to use style = "jenks"
natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans"
executes almost instantaneously.
library(classInt)
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)
system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
user system elapsed
13.46 0.00 13.45
system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
user system elapsed
0.02 0.00 0.02
What makes the Jenks algorithm so slow, and is there a faster way to run it?
If need be I will move the last two parts of the question to stats.stackexchange.com:
- Under what circumstances is kmeans a reasonable substitute for Jenks?
- Is it reasonable to define classes by running classInt on a random 1% subset of the data points?