7

![enter image description here][1]

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.001, min_samples=10) 
clustering = dbscan.fit(X)

Example vectors:

array([[ 0.05811029, -1.089355  , -1.9143777 , ...,  1.235167  ,
    -0.6473859 ,  1.5684978 ],
   [-0.7117326 , -0.31876346, -0.45949244, ...,  0.17786546,
     1.9377285 ,  2.190525  ],
   [ 1.1685177 , -0.18201494,  0.19475089, ...,  0.7026453 ,
     0.3937522 , -0.78675956],
   ...,
   [ 1.4172379 ,  0.01070347, -1.3984257 , ..., -0.70529956,
     0.19471683, -0.6201791 ],
   [ 0.6171041 , -0.8058429 ,  0.44837445, ...,  1.216958  ,
    -0.10003573, -0.19012968],
   [ 0.6433722 ,  1.1571665 , -1.2123466 , ...,  0.592805  ,
     0.23889546,  1.6207514 ]], dtype=float32)

X is model.wv.vectors, generated from model = word2vec.Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1)

Results are as follows:

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])

Jing
  • 89
  • 1
  • 4
  • 1
    It's going to be hard for people to answer your question if they cannot replicate the code. Can you take the code from the images and format them here. Also, if you have some sample data you could provide that would help us in problem solving with you. – lwileczek Jan 16 '20 at 18:28
  • @lwileczek I just don't know how to write the code here. – Jing Jan 16 '20 at 18:44
  • 2
    Can you show the code that actually outputs the array of `-1` values? Also, per the `DBSCAN` docs, it's designed to return `-1` for 'noisy' sample that aren't in any 'high-density' cluster. It's possible that your word-vectors are so evenly distributed there are no 'high-density' clusters. (From what data are you training the word-vectors, & how large is the set of word-vectors? Have you verified the word-vectors appear sensible/useful by other checks?) – gojomo Jan 16 '20 at 19:09
  • You might need to tune the DBSCAN parameters for your data. And, it might make sense to operate on the unit-length-normed word-vectors, instead of the raw magnitude vectors. (Execute `model.wv.init_sims()`, then use `model.wv.vectors_norm` instead of `model.wv.vectors`.) Finally, `min_count=1` usually results in worse word-vectors than a higher `min_count` value that discards words with so few usage examples. Rare words can't get strong vectors, & keeping them in training also interferes with improvement of other more-frequent words' vectors. – gojomo Jan 16 '20 at 19:12
  • @gojomo show the array with {clustering.labels_}. And I will try your suggestion later~ thanks. – Jing Jan 16 '20 at 19:44
  • @gojomo sorry, I still don't know how to code in the comment box.... – Jing Jan 16 '20 at 19:45
  • The indentation you've used in the topmost 3 lines of code you've shown is one perfectly fine way of formatting code excerpt. There's a lot more info on ways to present your typed, or copied-and-pasted, text of code or output at: https://stackoverflow.com/editing-help – gojomo Jan 16 '20 at 20:31
  • @gojomo I tried your way with {model.wv.vectors_norm} and {model.wv.vectors}. I cannot set min_count higher, since in my dataset, there are DishNames that only show once. – Jing Jan 16 '20 at 21:00
  • @gojomo and with more than 30k words, and the result is bad too, only -1. – Jing Jan 16 '20 at 21:04
  • @gojomo stil bad even I set {min_count=5}....almost cry.... – Jing Jan 16 '20 at 21:31
  • Words that only have 1 example in your training data are unlikely to get good word-vectors. Their final positions will be some mix of their random starting position, & the influence of the possibly-arbitrarily idiosyncratic single usage example – offset by the influence of all the other more-frequent words on the neural-network's weights. So any patterns of their neighborhoods for clustering may be weak – they are nearly 'noise', so it wouldn't be surprising if they contribute to leaving `DBSCAN` with lots of 'noisy' results. – gojomo Jan 17 '20 at 18:11
  • 30k total words would be a tiny, tiny dataset for `Word2Vec` purposes. Is that the size of the corpus, or the number of unique words? With a small corpus, or small number of unique words, but still multiple varied examples of each word, you might be able to get useful `Word2Vec` results with smaller `size` dimensions & more `epochs` training-passes, but it's not certain. Have you been able to check the vectors for usefulness separate from the clustering, by spot-reviewing if vectors' `most_similar()` neighbors make sense according to your domain understanding? – gojomo Jan 17 '20 at 18:15
  • Your best chance of getting some contrastingly-meaningful vectors could be to do all of: (1) higher `min_count` (while observing to see exactly how far this further shrinks the effective corpus); (2) more `epochs`; (3) fewer `size` dimensions. (Possibly also: larger `window` or `negative`.) Then also, using `vectors_norm` (to move all vectors to points on the 'unit sphere' for more contrast given the `DBSCAN` euclidean-neighborhoods). Then also, tinkering with the `DBSCAN` parameters to make it more sensitive. – gojomo Jan 17 '20 at 18:19
  • But still, you might not have enough data for `Word2Vec` to work well, and `DBSCAN` clustering might not be good for even stronger `Word2Vec` vectors, unless you have some external reason to believe these are the right algorithms for your data/problem-domain. Why do you want to create a fixed number of clusters from these word-vectors? – gojomo Jan 17 '20 at 18:20

2 Answers2

6

Based on the docs:

labels_array, shape = [n_samples]

Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

The answer to this you can find here: What are noisy samples in Scikit's DBSCAN clustering algorithm?

Shortword: These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent. It seems that you have really different data, which does not have central clustering classes.

What you can try?

DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

You can play with the parameters or change the clustering algorithm? Did you try kmeans?

PV8
  • 5,799
  • 7
  • 43
  • 87
  • I tried yours and it's better. But not good enough, results are -1 and 0. I had tried with Kmeans, and it worked well. I'm so curious about why such difference exists. – Jing Jan 17 '20 at 11:28
1

Your eps value is 0.001; try increasing that so that you get clusters forming (or else every point will be considered an outlier / labelled -1 because it's not in a cluster)

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
ron_g
  • 1,474
  • 2
  • 21
  • 39