20

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:

def VarianceThreshold_selector(data):
    selector = VarianceThreshold(.5) 
    selector.fit(data)
    selector = (pd.DataFrame(selector.transform(data)))
    return selector
x = VarianceThreshold_selector(data)
print(x)

changes the following data (this is just a small subset of the rows):

Survived    Pclass  Sex Age SibSp   Parch   Nonsense
0             3      1  22   1        0        0
1             1      2  38   1        0        0
1             3      2  26   0        0        0

into this (again just a small subset of the rows)

     0         1      2     3
0    3      22.0      1     0
1    1      38.0      1     0
2    3      26.0      0     0

Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :

     Pclass         Age      Sibsp     Parch
0        3          22.0         1         0
1        1          38.0         1         0
2        3          26.0         0         0

Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.

  • Scikit itself doesn't support `pandas` data types with named columns and the like, so any time you use something like the `.transform` method of a scikit object, you're going to lose all that information. If you can track it separately (i.e., retrieve the column names as you describe), you can pass it back it to specify the new column names after recreating a new DataFrame. – BrenBarn Oct 02 '16 at 01:00

6 Answers6

40

Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0
Jarad
  • 17,409
  • 19
  • 95
  • 154
  • could you please edit your answer? **selector.get_support(indices=True)** returns an array of indices. Thus, this line: **labels = [columns[x] for x in selector.get_support(indices=True) if x]** has a latent bug where column 0 will be skipped – tday03 May 24 '18 at 23:23
  • That looks correct! The columns variable is no longer used, but is irrelevant – tday03 May 25 '18 at 18:42
19

I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.

However, you can subset the data a bit more cleanly like this:

data_transformed = data.loc[:, selector.get_support()]
gc5
  • 9,468
  • 24
  • 90
  • 151
pteehan
  • 807
  • 9
  • 19
6

There's probably better ways to do this, but for those interested here's how I did:

def VarianceThreshold_selector(data):

    #Select Model
    selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples

    #Fit the Model
    selector.fit(data)
    features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
    features = [column for column in data[features]] #Array of all nonremoved features' names

    #Format and Return
    selector = pd.DataFrame(selector.transform(data))
    selector.columns = features
    return selector
  • 2
    We had basically the same idea with the exception of transform vs using fit_transform. Glad you figured it out. – Jarad Oct 02 '16 at 02:35
  • 3
    I'm a Python noob, but would it also be correct to do `features = data.columns.values[selector.get_support(indices = True)]`? I had trouble getting your approach to work with my data. – beldaz Nov 21 '17 at 07:38
  • Add columns to parse the columns : features = [column for column in df_train.columns[features]] – Nando Mar 11 '23 at 00:36
2

As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.

def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
    df1 = df.copy(deep=True) # Make a deep copy of the dataframe
    selector = VarianceThreshold(thresh)
    selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
    df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values

    return df2
Jan Janiszewski
  • 432
  • 3
  • 14
1

how about this as a code?

columns = [col for col in df.columns]

low_var_cols = []

for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
    low_var_cols.append(col)

then drop the columns from the dataframe?

0

You can use Pandas for thresholding too

data_new = data.loc[:, data.std(axis=0) > 0.75]
SaTa
  • 2,422
  • 2
  • 14
  • 26