Pick m points per cluster

Question

I have 100m pairs of that form:

(point_index, cluster_index)

The goal is to select (the first? It doesn't matter) m points for every cluster. The clusters are 16k in number, at max. How to do this efficiently?

m shall be small, <=100.

My first attempt:

Sort the pairs by cluster_index.
Linearly traverse the pairs and if less than m points from the current cluster are selected, then print point, else do nothing until next cluster is found.

That would yield a:

O(nlogn)

time complexity, where n = 100m. However notice that I am interested in the actual application only, and not for a lower bound with a huge constant for example! The algorithm will be executed in javascript via laptops.

What is the actual JS data structure? You've used parentheses for your pair, but is each pair an array, and then you have an array of pair arrays, or...? — nnnnnn, Sep 06 '16 at 04:40
Storing it in `result` might be more beneficial: just iterate over the collection and fill it. — zerkms, Sep 06 '16 at 04:40
@nnnnnn I haven't decided that yet! The data are still baking..I use a sample, thus didn't really think of that, but I should! zerkms, to be honest, I haven't played around with [tag:javascript] and data structures. I will look that into, but if you have some time to provide an answer with an example, that would be welcome! :) — gsamaras, Sep 06 '16 at 04:43
@gsamaras just an object, with key = cluster index and values = array of points. — zerkms, Sep 06 '16 at 04:44
In JS the standard does not put any constraint on how exactly it's implemented, but it would be fair to expect it to be something like a hash table indeed. So, just a JS object, `{}` — zerkms, Sep 06 '16 at 04:47
I would have to do some research on that @zerkms, since I can't tell if there is something ready (which I would imply from your last comment), or I should built something myself (which I would imply from your 1st comment). That's why I asked for a toy example. Must sleep, thank you! Or, if you are not in the mood for an example, an answer explaining your reasoning would be also nice! — gsamaras, Sep 06 '16 at 04:49
Does the cluster are of similar size? If it is the case you may consider taking points at random you should expect to have m of each after taking m*c*log(m*c) + 0.6 * (m*c) points. If m not too big is not that bad. — user1470500, Sep 06 '16 at 04:59
@user1470500 I hope so, the [Unbalanced factor](http://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans) is <2. Sounds cool, please post an answer explaining on how the algorithm would like exactly and I will check tomorrow! — gsamaras, Sep 06 '16 at 05:02
@user1470500 you would have to check whether the point was not already picked. That's another linear scan. In any case, for an unordered collection, how picking random items is better than just iterating from the first to last? — zerkms, Sep 06 '16 at 05:02
@zerkms It will not change anything except not always choosing the same points. Good point for the check, I forgot about it but i am wondering if there isn't a way of doing a pseudo-random sequence without repetitions and without having to check? Also my formula was wrong, the expectation would be c*log(c) + m * c * log(log(c)) + O(c) points before having m of each but I need to check what is the number of points to have 99.9 % of chance of having m of each to see if it is usable. — user1470500, Sep 06 '16 at 05:17
Plus I forget the condition m << n / c so my idea may be really dumb in fact. — user1470500, Sep 06 '16 at 05:25
@user1470500 "if there isn't a way of doing a pseudo-random sequence without repetitions and without having to check?" --- there is: take a prime number (P) larger than the number of elements in the array. Pick random first item. Then increment by the chosen P treating an array as a ring (or just using `%` to handle the overlaps). Not sure if this algorithm has a proper name, I've read it somewhere years ago. Quick and dirty implementation in js https://jsfiddle.net/3p73cqng/ — zerkms, Sep 06 '16 at 05:34

score 1 · Accepted Answer · answered Sep 06 '16 at 06:22

A solution with the following hypotheses:

No specific data-structure, just a list of points with clusters
Cluster sizes are balanced
m << n / c where n is the number of points, and c the number of clusters

Following these hypotheses, taking points at random could give quick results. To take a random permutation you can use @zerkms algorithm.

Take a prime p > n.

clustercount = Array(size = c, filled_with = 0)
i = randint(0, p)
complete = 0
while (complete < c*m) {
   if (clustercount[points[i].cluster] < m) {
      clustercount[points[i].cluster] = 1 + clustercount[points[i].cluster]
      plot(points[i])
      complete = complete + 1
  }
i = i + p % n
}

In average this method will require c*log(c) + m * c * log(log(c)) + O(c) iterations.

Pick m points per cluster

1 Answers1