I have 100m pairs of that form:
(point_index, cluster_index)
The goal is to select (the first? It doesn't matter) m
points for every cluster. The clusters are 16k in number, at max. How to do this efficiently?
m
shall be small, <=100.
My first attempt:
- Sort the pairs by
cluster_index
. - Linearly traverse the pairs and if less than
m
points from the current cluster are selected, then print point, else do nothing until next cluster is found.
That would yield a:
O(nlogn)
time complexity, where n = 100m. However notice that I am interested in the actual application only, and not for a lower bound with a huge constant for example! The algorithm will be executed in javascript via laptops.