Clustering in High Dimensions + some basic stuff

Question

I've been studying Support Vector Machines(SVM) for a while, and recently started reading articles on Clustering. When using SVM, we did not need to worry about the dimension size of the data, however, I learned that in clustering, due to the "Curse of Dimensionality", the dimension size is of big issue. Furthermore, the sparsity and the data size greatly affects the clustering algorithms you choose as well. So I kind of understand that there is no "best algorithm" for clustering, and it all depends on the nature of the data.

Having said that, I want to ask some really basic questions on Clustering.

When people say "High Dimension", what do they mean specifically?? Is 100d a high dimension?? Or does this depend on the type of data you have?
I've seen answers on this website that said something like, "using k-means on data with 100's dimensions is very usual", and if this is true, does this hold true for other clustering algorithms that uses the same distance metric as k-means??
In pp.649 of the paper, "Survey of Clustering Algorithms"(http://goo.gl/WQyuxo), by Rui Xu et al., the table shows that CURE has "the capability of tackling high dimensional data", and was wondering if anybody has any ideas on how high of dimension they are talking about.
If I wanted to perform clustering on high dimensional datas with adequate size, which was randomly sampled from the initial big data, what kind of algorithms would be appropriate to use?? I understand that density based algorithms such as DBSCAN does not perform well under random sampling.
Can anybody tell me how well/bad CURE performs on high dimensional datas?? Intuitively, I guess CURE does not perform well considering the "Cure of Dimensionality", however, it would be great if there were some detailed results.
Are there any websites/papers/textbooks on explaining the pros and cons of clustering algorithms?? I've seen some papers on the pros/cons of basic algorithms, i.e, k-means, hierarchal clustering, DBSCAN, etc., but wanted to know more on other algorithms such as CURE, CLIQUE, CHAMELEON, etc.

Sorry for asking so much questions all at once!! It will be awesome if anybody could answer any one of my questions. Also, if I had ill-stated a question or asked a completely pointless question, don't hesitate to tell me. And if anybody knows a great textbook/survey paper on Clustering that elaborates on these subjects, please tell me!! Thank you in advance.

To me (NLP guy), 100-d is low-dimensional. Think 10k features or more. In fact when I do k-means, I usually do dimensionality reduction to at most a few 1000-d to get better/more stable clustering. But it depends on the space spanned by the features. — Fred Foo, Apr 30 '14 at 15:57
Text data is special, because it is sparse and usually discrete or even binary. 10k features may yield only some 100 bits of information. Continuous vector data has up to 64 bits per dimension (say, 20 in reality), so 5-20 continuous dimensions may be as challenging or troublesome as 10k of term-frequency dimensions. I don't know if there is any MDL type of study that examines this relationship of information density to the curse of dimensionality. — Has QUIT--Anony-Mousse, Apr 30 '14 at 17:56
#1 and #2 are probably opinion-based (my definition of "High Dimension" can differ from someone else', and "very usual" is certainly relative), #3 is probably off topic, #4 is probably too broad, #5 probably needs a bit more research effort on your part and more specific details, #6 is off topic (see the [help]). And you really should stick to a single question per question - asking multiple questions doesn't fit well into the [so] model as some answers may only answer parts or answer some parts incorrectly. — Bernhard Barker, Apr 30 '14 at 18:05
You should post questions separately, not as one big question. — Kevin Reid, Apr 30 '14 at 18:40
Thanks for the comments, and sorry for the format. I'll try better next time. — ruparunpa, May 01 '14 at 01:35

Has QUIT--Anony-Mousse · Accepted Answer · 2014-04-30T17:58:05.483

Clustering in High Dimensions + some basic stuff

1 Answers1