1

I have come to a situation where I have mixed data set as mentioned and try unsupervised clustering.

I am trying many different experiments including Gower's distance and K-prototype. I wanna try some of sklearn metrics to see how they will give me values.

While I was looking at silhouette_score, there is an argument 'metric' and I can decide with what I want to compute distances. But as my data has mixed types, I would like to choose manhattan for numerical and hamming for categorical. Is there a way I can use silhouette_score for both metrics at one go? if all my input data were numerical, I would have done as below:

silhouette_score(friendRecomennderData, labels, metric = 'manhattan')

Thank you in advance.

S. Jay
  • 141
  • 2
  • 10

2 Answers2

0

You are getting confused in the arguments that are passed to silhouette_score. If you read the documentation mentioned here, it say the following about the input data, i.e. the parameter X:

X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise. Array of pairwise distances between samples, or a feature array.

Thus the data can only be a numerical array comprising of distances between the samples. It's not possible to have distances as categorical values.

You need to first cluster your data, then get the distance matrix and provide the distance matrix as input to silhouette_score.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • 1
    Thank you for your feedback. But I dont see that I cant do with categorical. with ur above quoting. If I can measure categorical dissimilarity and numerical distance and combine them in a meaningful way (That is my fundamental question in the post). It should work. As you mentioned, if we cant use categorical, there is no reason that there is a hamming or jaccard metrics for distance calculation. Isnt it? – S. Jay Aug 26 '20 at 02:28
  • @HJSeo Let me explain it the other way around. How do you define the distance between two strings, say `alakazam` and `jigglypuff` ? – Gambit1614 Aug 26 '20 at 07:33
  • 1
    It could be just in word level, I can count the frequency and use Hamming metric to do it as I would like to do now. If it is character level, I would use levenshtein_distance perhaps. This is exactly why I wanna use different metrics like gower metrics or so. – S. Jay Aug 27 '20 at 02:06
0

You can use distance metrics like gowers distance which deals with mixed data types and then use computed distance matrix as X and metric = 'precomputed' in silhouette_score function.