I have a dataframe with 300 float type columns and 1 integer column which is the dependent variable. The 300 columns are of 3 kinds: 1.Kind A: columns 1 to 100 2.Kind B: columns 101 to 200 3.Kind C: columns 201 to 300 I want to reduce the number of dimensions. Should I average the values for each kind and aggregate into 3 columns(one for each type), or should I perform some dimensionality reduction techniques like PCA? What is the justification of the same?
2 Answers
Option 1:
Do not do dimensionality reduction if you have large number of training data (say more then 5*300 sample for training)
Option 2:
Since you know that there are 3 kinds of data, run a PCA of those three kinds separately and get say 2 features for each. i.e
f1, f2 = PCA(kind A columns)
f3, f4 = PCA(kind B columns)
f5, f6 = PCA(kind C columns)
train(f1, f2, f3, f4, f5, f6)
Option 3
Run PCA on all columns and only take number of columns which preserve 90+ variance
Do not average, averaging is bad. But if you really want to do averaging and if you know certainly that some features are important rather do weighted average. In general averaging of features for dimensionally reduction is a very bad idea.

- 16,186
- 2
- 33
- 51
PCA will only consider the rows which will have highest co-relation with the output / result. So not all rows will be considered as a part of process to determine the output. So it will be better if u do averaging as it will consider all the rows and determine the output from them. As u have a larger number of features it is better if all the features are used to determine output.

- 1
- 1