How do you know if your dataset suffers from high-dimensionality problems?

Question

There seems to be many techniques for reducing dimensionality (pca, svd etc) in order to escape the curse of dimensionality. But how do you know that your dataset in fact suffers from high-dimensionality problems? Is there a best practice, like visualizations or can one even use KNN to find out?

I have a dataset with 99 features and 1 continuous label (price) and 30 000 instances.

Sijan Bhandari · Answer 1 · 2020-07-23T08:41:41.083

-1

The curse of dimensionality refers to a problem that dictates the relation between feature dimension and your data size. It has been pointed out that as your feature size/dimension grows, the amount of data in order to successfully model your problem will also grow exponentially.

The problem actually arises when there is exponential growth in your data. Because you have to think of how to handle it properly ( storage/ computation power needed). So we usually experiment to figure out the right size of the dimension we need for our problem (maybe using cross-validation) and only select those features. Also, keep in mind that using lots of features comes with a high risk of overfitting.

You can use either Feature selection or feature extraction for dimension reduction.LASSO can be used for feature selection or PCA, LDA for feature extraction.

edited Jul 23 '20 at 08:41

answered Jun 04 '20 at 08:53

Sijan Bhandari

2,941
3
23
36

Yes thanks, I know this. But what experiments and visualizations do you run? Or do I just know that there is a dimensionality problem from comparing the accuracy and runtime when evaluating subsets of the final models? – endorphinus Jun 04 '20 at 09:32
I have already mentioned about cross-validation approach to select relevant features for your problems. That means, you run your experiments with subset of features and see how your model works using cross validation. You might consider plotting accuracy on each selection of features and see which one works well for you. If you want to look at an example : [sklearn feature elimination](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py) – Sijan Bhandari Jun 04 '20 at 09:49
ok. Lets say that you do RFE with CV and then you end up with a smaller subset that makes your model perform better, how do you know this is due to common curse of dimensionality problems like data becoming sparse and more extreme? Could it not just be that one removed some noise? – endorphinus Jun 04 '20 at 10:32
I think you haven't cleared your concept on cross-validation. Cross-validation itself get rid of noisy data because we are using three different subsets (train/validation/test sets) and k-fold validation runs on different sets, creating different surrogate models that will give the more accurate estimate of your model performance, discarding the noise in your data. – Sijan Bhandari Jun 04 '20 at 17:47

How do you know if your dataset suffers from high-dimensionality problems?

1 Answers1