In clustering what effects does noisy,redundant, and irrelevant attributes have on it? Do they end up helping or hurting clustering?I know that it is unable to handle noisy data but not sure on the other two.
1 Answers
Noise
Performance of many clustering algorithms like k-means, partitioning around median etc. degrades as the percentage of noise increases. For examples in k-means clustering, because of the outliers (data which is largely different from the data set), clustering centroid varies. The algorithm takes long time to converge and may not results in good clustering.
Most of the clustering algorithm prefer to remove the noise (outliers) from the data set before the clustering.
For more details: Effect of noise on the performance of clustering techniques
Redundant data (no redundant attribute but redundant data points)
This also effect the clustering in negative way but depends on the clustering algorithm. If any algorithm takes frequency of the data point into consideration (example taking mean of clustered points, median etc.) then mean, median of cluster may vary.
Normally you don't want to cluster data on the basis of likelihood of the occurrence of any data point. So if any data point is redundant, it is suggested to be removed before clustering.
If you consider redundant attrubute (i.e co-related attribute), it may or may not effect clustering. Depends on domain of data set.
Irrelevant attribute
This too effect clustering in negative way. Because of irrelevant attribute, clustering may not converge. In fact sometimes irrelevant attributes are considered as noise. Also with higher dimensions, comes the curse of dimensionality. So it is often suggested to perform dimensionality reduction before clustering.
Some details:
Clustering high dimensional data
Effect of irrelevant attribute on fuzzy clustering

- 2,159
- 16
- 26