4

The following plot displays the t-SNE plot. I can show it here but unfortunately, I can't show you the labels. There are 4 different labels:

enter image description here

The plot was created using a data frame called scores, which contains approximately 1100 patient samples and 25 features represented by its columns. The labels for the plot were sourced from a separate data frame called metadata. The following code was used to generate the plot, utilizing the information from both scores and metadata data frames.

tsneres <- Rtsne(scores, dims = 2, perplexity = 6)
tsneres$Y = as.data.frame(tsneres$Y)
ggplot(tsneres$Y, aes(x = V1, y = V2, color = metadata$labels)) + 
  geom_point()

My mission:

I want to analyze the t-SNE plot and identify which features, or columns from the "scores" matrix, are most prevalent in each cluster. Specifically, I want to understand which features are most helpful in distinguishing between the different clusters present in the plot. Is it possible to use an alternative algorithm, such as PCA, that preserves the distances between data points in order to accomplish this task? perhaps it's even a better choice than t-SNE?

This is an example of scores, this is not the real data, but it's similar:

structure(list(Feature1 = c(0.1, 0.3, -0.2, -0.12, 0.17, -0.4, 
-0.21, -0.19, -0.69, 0.69), Feature2 = c(0.22, 0.42, 0.1, -0.83, 
0.75, -0.34, -0.25, -0.78, -0.68, 0.55), Feature3 = c(0.73, -0.2, 
0.8, -0.48, 0.56, -0.21, -0.26, -0.78, -0.67, 0.4), Feature4 = c(0.34, 
0.5, 0.9, -0.27, 0.64, -0.11, -0.41, -0.82, -0.4, -0.23), Feature5 = c(0.45, 
0.33, 0.9, 0.73, 0.65, -0.1, -0.28, -0.78, -0.633, 0.32)), class = "data.frame", row.names = c("Patient_A", 
"Patient_B", "Patient_C", "Patient_D", "Patient_E", "Patient_F", 
"Patient_G", "Patient_H", "Patient_I", "Patient_J"))

EDIT - PYTHON

I got to the same point python. I tried PCA at first but it produced very bad plots. So I first reduced dimensions using t-SNE, which produced much better results and clustered the data using k-means. I still got the same question as before, just now I don't mind using R or python.

This is the new plot:

enter image description here

And this is the code:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
tsne_result = tsne.fit_transform(scores)

#create a dict to map the labels to colors
label_color_dict = {'label1':'blue', 'label2':'red', 'label3':'yellow', 'label4':'green'}

#create a list of colors based on the 'labels' column in metadata
colors = [label_color_dict[label] for label in metadata[['labels']]

plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=colors, s=50)
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='o')

# Add labels to the cluster centers
for i, center in enumerate(cluster_centers,1):
    plt.annotate(f"Cluster {i}", (center[0], center[1]), 
                 textcoords="offset points", 
                 xytext=(0,10), ha='center', fontsize=20)
Programming Noob
  • 1,232
  • 3
  • 14
  • If you want to preserve distances, I would say the best choice is multidimensional scaling: https://en.wikipedia.org/wiki/Multidimensional_scaling – mastropi Jan 22 '23 at 10:33
  • @mastropi I'm not sure I need it.. my main question is how to identify which columns are most prevalent in each cluster of the t-SNE plot. – Programming Noob Jan 22 '23 at 11:56
  • Yeah, I understood that, I was just pointing out that instead of PCA I thought multidimensional scaling would be more appropriate to preserve distances (i.e. I was just commenting on your sentence "Is it possible to use an alternative algorithm, such as PCA, that preserves distances between data points...?"). Apart from that, I am not able to help you with the question, at least not at this time as I don't have time to investigate further. But for sure, your question is quite interesting! :-) – mastropi Jan 23 '23 at 13:19
  • 1
    As you asked: I disencourage you from using t-SNE for feature analysis. tSNE is amazing for visualization and reveal that patterns&clusters exist in the data. But tSNE has some problems: it is affected by randomnes, other hyperparameter settings & scaling of your data. Depending on the perplexity clusters could break up or form. The goal is to find an embedding where similar values are close to each other, on the contrary further away points are not necessarily very dissimilar, which means distances in the embedding are not a good indicator & you cant reverse it to the original space&features. – Daraan Jan 27 '23 at 14:16
  • Are the "clusters" you're talking about the 4 labels you mention or the groups of points in your t-SNE plot? – m13op22 Jan 27 '23 at 16:48

2 Answers2

1

TSNE is a great way for visualization but it is not good for getting reduced feature space. And even if you are able to do dimensionality reduction effectively (i.e. using PCA with n=3), and you are able to get new features F1, F2 and F3 : It is not easy to find which original features contributed to differentiation between different clusters.

Hi, I agree with @MotaBtw that Silhouette is a good way to measure the feature importances. But I will try to explain the same according to your use case. By definition, Silhouette score will try to evaluate a clustering run by calculating the difference between mean inter cluster distance and mean intra cluster distance. The more this difference, the better the clustering run. See this detailed image

We can use the same logic a little differently, where we want to find the contribution of each feature on the silhouette score. The more the contribution of a feature, the important it is.

Created a working algo, and added the code in a github repo, since it was a little lengthy to be added here: https://github.com/vsablok123/silhouette_feature_importance/

def silhouette_feature_importance(X, labels):
"""
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each sample.
The Silhouette Coefficient for a sample is ``(b - a) / max(a, b)``.
To clarrify, b is the distance between a sample and the nearest cluster
that b is not a part of.
The feature importance is inferred by looking at the features which contribute the most
to the silhouette coefficient.
Parameters
----------
X : array [n_samples_a, n_features]
    Feature array.
labels : array, shape = [n_samples]
         label values for each sample
Returns
-------
silhouette : array, shape = [n_features]
    Feature importance for each feature

"""
n = labels.shape[0]
A = np.array([_intra_cluster_distance(X, labels,i)
              for i in range(n)])
B = np.array([_nearest_cluster_distance(X, labels, i)
              for i in range(n)])
print(f"A shape = {A.shape}")
print(f"B shape = {B.shape}")
sil_samples = abs(B - A)
# nan values are for clusters of size 1, and should be 0
return np.mean(np.nan_to_num(sil_samples), axis=0)

Steps to take for your usecase:

  1. Find the highly correlated features before you run clustering on 'scores' and remove them.
  2. Run clustering using any algorithm, but the clustering should be optimized. For e.g. In case of Kmeans, we should find the ideal n_clusters using the elbow method.
  3. Now, pass the clustering output to the silhouette_feature_importance algo to get the top features.

Let me know if there is issue in understanding the code.

0

Just as food for thought: How about you cluster the t-SNE dimensions into the seen clusters (e.g. 5 clusters). Afterwards you merge that information to the original dataframe containing the original variables and try to train a simple classification algorithm (e.g. catboost to keep it simple and reasonably well performing). Target of the classification algorithm would be to predict the clusters.

Lastly, you can then use explainable AI approaches, like shapley values or anchors to explain the decisions of your model.

Given that your model reaches reasonable performance, it would give you a list of drivers that relate to certain clusters.

If you need help with the code, let me know. Hope that helps.

Lukas Hestermeyer
  • 830
  • 1
  • 7
  • 19