2

My group and I are working on a high-dimensional dataset with a mix of categorical (binary and integer) and continuous variables. We are wondering what would be the best distance metric and linkage method to use for agglomerative hierarchical clustering. We first started with Euclidean distance and Ward's linkage, but with the issues that arise with Euclidean distance and categorical variables we need a new strategy. We have attempted Heterogeneous Euclidean-Overlap Metric (HEOM) and Gower's distance metric with average, centroid, and single linkage, but have not gotten the clear results that we were hoping for. We are wondering if there are better methods or metrics that we should use for our analysis?

Here is an example of the code we have already:

from distython import HEOM
categorical_ix = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 28, 34, 37, 39, 142, 41, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,  172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 213, 217, 218, 219, 220, 221, 222, 223, 224, 225]

nan_eqv = 12345

heom_metric = HEOM(features, categorical_ix, nan_equivalents = [nan_eqv])

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric(heom_metric.heom)
distance = dist.pairwise(features)

import scipy.cluster.hierarchy as shc
from scipy.cluster.hierarchy import linkage, dendrogram
linkage_matrix = linkage(distance, 'average')
plt.figure(figsize=(10, 7))  
plt.title("Test")
dendrogram(linkage_matrix)
plt.axhline(y=8, color='r', linestyle='--')
plt.show()

from scipy.cluster.hierarchy import fcluster
k = 4
clusters = fcluster(linkage_matrix, k, criterion='maxclust')
clusters

If Gower's distance or HEOM is the preferred method to use we would also appreciate any advice on how to better implement these metrics into our code. Thank you

cornell_ML
  • 21
  • 1

0 Answers0