0

I have a set of 10,000 points, each made up of 70 boolean dimensions. From this set of 10,000, I would like to select 100 points which are representative of the whole set of 10,000. In other words, I would like to pick the 100 points which are most different from one another.

Is there some established way of doing this? The first thing that comes to my mind is a greedy algorithm, which begins by selecting one point at random, then the next point is selected as the most distant one from the first point, and then the second point is selected as having the longest average distance from the first two, etc. This solution doesn't need to be perfect, just roughly correct. Preferably, this solution of 100 points can also be found within ~10 minutes but finishing within 24 hours is also fine.

I don't care about distance, in particular, that's just something that comes to mind as a way to capture "differentness."

If it matters, every point has 10 values of TRUE and 60 values of FALSE.

Some already-built Python package to do this would be ideal, but I am also happy to just write the code myself something if somebody could point me to a Wikipedia article.

Thanks

  • What purpose is your "100 most different" points being used for? Each purpose will imply its own practical definitions of "different", and this will lead to a choice of algorithm. This may be a good fit for a data science question too, but only if you want to explore the theory as opposed to getting something that sort of works. – Neil Slater Sep 05 '20 at 18:34
  • 2
    I wouldn't say that the 100 "most representative" points are the same as the 100 "most different" points. – John Gordon Sep 05 '20 at 18:35
  • 1
    Have a look at this [related question](https://stackoverflow.com/questions/62576860/how-to-get-the-k-most-distant-points-given-their-coordinates?rq=1). – Axel Kemper Sep 05 '20 at 20:01
  • I didn't want to get into distracting details that may prompt some readers to roll their eyes, but I am generating fantasy football lineups for a daily fantasy sports contest. My model will generate 10,000 good lineups of different players (a lineup is a combination of 10 players selected out of a pool of 70). I then want to identify the 100 most distinct lineups. By selecting widely varying lineups I increase my odds of one of them being a winner. Thanks for the link Axel, this is helpful. – user9154908 Sep 05 '20 at 21:49

2 Answers2

4

Your use of "representative" is not standard terminology, but I read your question as you wish to find 100 items that cover a wide gamut of different examples from your dataset. So if 5000 of your 10000 items were near identical, you would prefer to see only one or two items from that large sub-group. Under the usual definition, a representative sample of 100 would have ~50 items from that group.

One approach that might match your stated goal is to identify diverse subsets or groups within your data, and then pick an example from each group.

You can establish group identities for a fixed number of groups - with different membership size allowed for each group - within a dataset using a clustering algorithm. A good option for you might be k-means clustering with k=100. This will find 100 groups within your data and assign all 10,000 items to one of those 100 groups, based on a simple distance metric. You can then either take the central point from each group or a random sample from each group to find your set of 100.

The k-means algorithm is based around minimising a cost function which is the average distance of each group member from the centre of its group. Both the group centres and the membership are allowed to change, updated in an alternating fashion, until the cost cannot be reduced any further.

Typically you start by assigning each item randomly to a group. Then calculate the centre of each group. Then re-assign items to groups based on closest centre. Then recalculate the centres etc. Eventually this should converge. Multiple runs might be required to find an good optimum set of centres (it can get stuck in a local optimum).

There are several implementations of this algorithm in Python. You could start with the scikit learn library implementation.

According to an IBM support page (from comment by sascha), k-means may not work well with binary data. Other clustering algorithms may work better. You could also try to convert your records to a space where Euclidean distance is more useful and continue to use k-means clustering. An algorithm that may do that for you is principle component analysis (PCA) which is also implemented in scikit learn.

Neil Slater
  • 26,512
  • 6
  • 76
  • 94
  • 1
    With such a sparse problem-description, this low-friction approach might work, but in general i would be scared of doing k-means with non-euclidean metrics ([Clustering binary data with K-Means (should be avoided)](https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided)) and also the post-cluster step. If kmeans was used to *compress the decision-space*, i would exploit this by switching to some more *global* approaches in a final *polishing step*, e.g. exact integer-programming (with potential time-limit) with e.g. 5 candidates per cluster. – sascha Sep 05 '20 at 21:30
  • Ah, thanks for clarifying that point about what "representative" would mean. Your "wide gamut" description is more in line with my goals. I like this solution that you've posted. Nice and simple. Thanks Consistent with Sascha's recommendations, I may explore different clustering algorithms if k-means or other easy-to-implement ones don't produce results that pass an eyes-test. (Other parts of my project are far from perfect, so this clustering being less than ideal is fine). – user9154908 Sep 05 '20 at 22:02
  • @sascha: Thanks for that note. From the link it appears that one simple way to improve results if they are not satisfactory might be to use principle component analysis to map the data into a space where k-means works better. – Neil Slater Sep 06 '20 at 08:50
0

The graph partitioning tool METIS claims to be able to partition graphs with millions of vertices in 256 parts within seconds.

You could treat your 10.000 points as vertices of an undirected graph. A fully connected graph with 50 million edges would probably be too big. Therefore, you could restrict the edges to "similarity links" between points which have a Hamming distance below a certrain threshold.

In general, Hamming distances for 70-bit words have values between 0 and 70. In your case, the upper limit is 20 as there are 10 true coordinates and 60 false coordinates per point. The maximum distance occurs, if all true coordinates are differently located for both points.

Creation of the graph is a costly operation of O(n^2). But it might be possible to get it done within your envisaged time frame.

Axel Kemper
  • 10,544
  • 2
  • 31
  • 54