4

For working with exploratory data, which would be best clustering method? Currently I use HDBSCAN. Problem is that the results I get from using HDBSCAN in R is different from results obtained via HDSCBAN in Python.

R version: https://rdrr.io/cran/largeVis/man/hdbscan.html

Link to data file for R: https://www.dropbox.com/s/731hjrj0geibi3f/test.csv?dl=0

test_r <- data.frame("data")
vis <- largevis::largevis(test_r)
cluster <- largevis::hdbscan(vis)
largevis::gplot(cluster,t(vis$coords), text = TRUE)

OUTPUT of R

Python version: https://github.com/scikit-learn-contrib/hdbscan/tree/master/hdbscan

Link to data file for Python : https://www.dropbox.com/s/640elbjr1xt8q3e/test_projection.txt?dl=0

%pylab
import hdbscan
import numpy as np
import seaborn as sns
import matplotlib.pyploy as plt
import pandas as pd

projection = np.loadtxt("data")
projection = projection[1:1001,:]

clusterer = hdbscan.HDBSCAN(min_cluster_size=20, gen_min_span_tree=True)
clusterer.fit(projection)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
              if col >= 0 else (0.5, 0.5, 0.5) for col, sat in
              zip(clusterer.labels_, clusterer.probabilities_)]

fig = plt.scatter(panc_projection.T[0], panc_projection.T[1], c= cluster_colors)

OUTPUT of Python

What is the reason for difference between output of two versions and how to determine accuracy in terms of results? (i.e. number of clusters, cluster size and noise)

http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Div Trivedi
  • 91
  • 2
  • 8

0 Answers0