0

I've referenced this answer and had a lot of success using it, as the output that it provides seems somewhat accurate. I've also slightly changed it such that I can encode the number of clusters desired, like so:

df = pd.DataFrame(A)
corr = df.corr().values

cluster_count = 50
pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, cluster_count, 'maxclust')

My question is whether or not it's possible to modify this code to include the potential for items to belong to multiple clusters if they highly align with two or more of them.

That is, while most items may potentially belong to only one cluster, if there is an item that is extremely highly correlated with more than one cluster, then it should be present in both.

Currently, I find it a bit unfortunate that some items have to "choose" the one cluster to be a part of, when they align highly with multiples.

Ryan Peschel
  • 11,087
  • 19
  • 74
  • 136
  • Do you have to use `spc.fcluster` (hierarchical clustering)? If not, you can use k-means and then decide what clusters each item belongs to, by looking at the distance from each centroid. – Dvir Cohen Jan 27 '20 at 13:38
  • I'm pretty new to stats, so I was just trying random methods. That seems like a pretty good idea though! I'll try it out – Ryan Peschel Jan 27 '20 at 14:18

0 Answers0