0

It is a common practice in data analysis to remove features (independent variables) with low variance for dimensionality reduction, with the justification that a feature with low variance cannot explain much of the variance in the response variable (dependent variable).

However, I don't exactly understand this reasoning. Here is a counter example (in R syntax):

 > independent_variable <- c(100000, 100000.01, 100000.02, 100000.03, 100000.04, 100000.05 )
 > dependent_variable  <- c(1,2,3,4,5,6)
 > cor(independent_variable , dependent_variable)
 [1] 1          #pearsons correlation = 1
 > var(independent_variable )
 [1] 0.00035     
 > var(dependent_variable)
 [1] 3.5        # low variance of independent variable compared to dependent variable
 > var(independent_variable/mean(independent_variable))
 3.499998e-14   # very low variance
 > var(dependent_variable/mean(dependent_variable))
 [1] 0.2857143  # variance of scaled variables with mean=1
 

What I try to demonstrate in this example is a case where the dependent and independent variables have correlation=1 i.e. the independent variable explains 100% of the variance of the dependent variable, and yet, both in the original and in the mean=1 scaled variables, the variance of the independent variable is much lower than the variance of other variables (in this case, the dependent variable) and therefore it would have been removed according to this reasoning.

What do I miss here?

Amnon
  • 195
  • 3
  • 12
  • 1
    You are absolutely right, low variance by itself is not a suitable criterion for picking variables, for exactly the reason you showed. My advice is to look for variables which are related to the prediction target, ideally by trying all subsets (yup) of variables with the model class you've chosen. Workable approximations include choosing variables one by one according to mutual information or simple correlation, and choosing groups of variables (e.g. all subsets of groups of variables -- probably a much smaller number). Also, look for redundancies between inputs and try to omit redundant ones. – Robert Dodier Feb 03 '21 at 19:09
  • This is a fundamental question, but it's off topic for SO; try stats.stackexchange.com for further discussion. – Robert Dodier Feb 03 '21 at 19:09

0 Answers0