14

I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:

if(!require("cluster")) { install.packages("cluster");  require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))

This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R?

Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?

www
  • 38,575
  • 12
  • 48
  • 84
Zhubarb
  • 11,432
  • 18
  • 75
  • 114

2 Answers2

17

Just to implement a Gower function to use with pdist won´t be enough.

Internally the pdist makes several numerical transformations that will fail if you use a matrix with mixed data.

I implemented the Gower function, according the original paper, and the respective adptations necessary in the pdist module (I could not simply override the functions, because the defs in the pdist module are private).

The results I obtained with this so far are the same from R´s daisy function.

The source code is avilable at this jupyter notebook: https://sourceforge.net/projects/gower-distance-4python/files/

  • This looks great :) are there plans for this to be included in sklearn or published otherwise? – JP1 May 01 '18 at 11:48
  • 3
    Yes, there is a ticket on the way for sklearn (https://github.com/scikit-learn/scikit-learn/issues/5884), I'm fixing some points after review of my pull request, hopefully we'l get this implementation pushed to master of this project. – Marcelo Beckmann May 05 '18 at 13:44
  • Can I ask - is there a differnence between gower distance and similarity? My assumption is that similarity = 1-distance? – JP1 May 08 '18 at 06:54
  • Hi, the Gower distance is a similarity measure, and in fact there is no mention about dissimilarity in the original paper (http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf). – Marcelo Beckmann May 13 '18 at 19:19
  • Hi, @MarceloBeckmann thank you for your implementation. However, it does not scale well. For 5000 data points Mahalanobis takes 1 second, but your Gower 3 minutes. Can you vectorized the code? Thx :) – lambruscoAcido Dec 04 '20 at 11:28
10

I believe you are looking for scipy.spatial.distance.pdist.

If you implement a function that computes the Gower distance on a single pair of observations, you can pass that function to pdist and it will apply it pairwise and return the resulting matrix of pairwise distances. It does not appear that the Gower distance is one of the built-in options.

Likewise, if a single observation has mixed attributes, you can just define your own function which, say, uses something like the Euclidean distance on the subset of numerical attributes, a Gower distance on the subset of categorical attributes, and adds them -- or any other implementation of what it means to you, for your application, to compute the distance between two isolated observations.

For clustering in Python, usually you want to work with scikits.learn and this question and answer page discusses exactly this problem of using a custom distance measure (in your case Gower) with scikits -- which does not appear possible.

You could use one of the choices provided by pdist along with the implementation at that linked answer page -- or you could implement a function for the Gower similarity and use that. But if you want the out-of-the-box clustering tools from scikits, it does not appear to be directly possible.

Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • 1
    Thank you, do you know of any out of the box distance metrics available in scikitlearn that can jointly deal with categorical and numeric variables? – Zhubarb Oct 15 '14 at 17:09
  • 2
    I do not. Their documentation is good, so searching should reveal results quickly if it exists. However, my approach would be to define my own small distance function that handled this how I wanted, and to pass that off to `pdist`. That way I could control the relative importance of different aspects of that calculation. If this became slow, I would either use numba or Cython to target implementing just that function at a lower level to speed it up. – ely Oct 15 '14 at 17:11