Feature selection using statistical model

Question

Problem statement :

I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.

The highlighted column is my Target column where

0 means Yes

1 means No

There are 47 independent column in this data set.

I want to do feature selection on these columns against my Target column!!

I started with Z-test

import numpy as np
import scipy.stats as st
import scipy.special as sp


def feature_selection_pvalue(df,col_name,samp_size=1000):
    relation_columns=[]
    no_relation_columns=[]
    H0='There is no relation between target column and independent column'
    H1='There is a relation between target column and independent column'
    sample_data[col_name]=df[col_name].sample(samp_size)
    samp_mean=sample_data[col_name].mean()
    pop_mean=df[col_name].mean()
    pop_std=df[col_name].std()
    print (pop_mean)
    print (pop_std)
    print (samp_mean)
    n=samp_size
    q=.5
    #lets calculate z
    #z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
    z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
    print (z)
    pval = 2 * (1 - st.norm.cdf(z))
    print ('p values is==='+str(pval))
    if pval< .05 :
        print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)

        no_relation_columns.append(col_name)
    else:
        print ('Alternate Hypothesis is accepted -->'+H1)
        relation_columns.append(col_name)
        print ('length of list ==='+str(len(relation_columns)))


    return relation_columns,no_relation_columns

When i run this function , i always gets different results

for items in df.columns:
    relation,no_relation=feature_selection_pvalue(df,items,5000)

My question is

is above z-Test a reliable mean to do feature selection, when result differs each time
What would be a better approach in this case to do feature selection, if possible provide an example

score 0 · Answer 1 · answered Sep 12 '19 at 06:07

What would be a better approach in this case to do feature selection, if possible provide an example

Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features: https://scikit-learn.org/stable/modules/feature_selection.html

If we look at the first one (Variance threshold):

from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)

this will only keep the columns which have some variance and not have only the same value in it for example.

Feature selection using statistical model

1 Answers1