Clustering algorithms: HDBSCAN in R vs HDBSCAN in Python?

Question

For working with exploratory data, which would be best clustering method? Currently I use HDBSCAN. Problem is that the results I get from using HDBSCAN in R is different from results obtained via HDSCBAN in Python.

R version: https://rdrr.io/cran/largeVis/man/hdbscan.html

Link to data file for R: https://www.dropbox.com/s/731hjrj0geibi3f/test.csv?dl=0

test_r <- data.frame("data")
vis <- largevis::largevis(test_r)
cluster <- largevis::hdbscan(vis)
largevis::gplot(cluster,t(vis$coords), text = TRUE)

OUTPUT of R

Python version: https://github.com/scikit-learn-contrib/hdbscan/tree/master/hdbscan

Link to data file for Python : https://www.dropbox.com/s/640elbjr1xt8q3e/test_projection.txt?dl=0

%pylab
import hdbscan
import numpy as np
import seaborn as sns
import matplotlib.pyploy as plt
import pandas as pd

projection = np.loadtxt("data")
projection = projection[1:1001,:]

clusterer = hdbscan.HDBSCAN(min_cluster_size=20, gen_min_span_tree=True)
clusterer.fit(projection)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
              if col >= 0 else (0.5, 0.5, 0.5) for col, sat in
              zip(clusterer.labels_, clusterer.probabilities_)]

fig = plt.scatter(panc_projection.T[0], panc_projection.T[1], c= cluster_colors)

OUTPUT of Python

What is the reason for difference between output of two versions and how to determine accuracy in terms of results? (i.e. number of clusters, cluster size and noise)

http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

Show an example! And try to find the original version, not reimplementations. — Has QUIT--Anony-Mousse, May 05 '18 at 08:00
@Anony-Mousse I think you are correct. API in python has many parameters and hyper parameters. I was able to replicate results in other data set after many trial and error of combinations. I personally feel Python API gives more control and flexibility compared to R. — Div Trivedi, May 10 '18 at 15:24

Clustering algorithms: HDBSCAN in R vs HDBSCAN in Python?

0 Answers0