-1

Problem statement :

I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.

view of data

The highlighted column is my Target column where

0 means Yes

1 means No

There are 47 independent column in this data set.

I want to do feature selection on these columns against my Target column!!

I started with Z-test

import numpy as np
import scipy.stats as st
import scipy.special as sp


def feature_selection_pvalue(df,col_name,samp_size=1000):
    relation_columns=[]
    no_relation_columns=[]
    H0='There is no relation between target column and independent column'
    H1='There is a relation between target column and independent column'
    sample_data[col_name]=df[col_name].sample(samp_size)
    samp_mean=sample_data[col_name].mean()
    pop_mean=df[col_name].mean()
    pop_std=df[col_name].std()
    print (pop_mean)
    print (pop_std)
    print (samp_mean)
    n=samp_size
    q=.5
    #lets calculate z
    #z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
    z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
    print (z)
    pval = 2 * (1 - st.norm.cdf(z))
    print ('p values is==='+str(pval))
    if pval< .05 :
        print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)

        no_relation_columns.append(col_name)
    else:
        print ('Alternate Hypothesis is accepted -->'+H1)
        relation_columns.append(col_name)
        print ('length of list ==='+str(len(relation_columns)))


    return relation_columns,no_relation_columns

When i run this function , i always gets different results

for items in df.columns:
    relation,no_relation=feature_selection_pvalue(df,items,5000)

My question is

  1. is above z-Test a reliable mean to do feature selection, when result differs each time
  2. What would be a better approach in this case to do feature selection, if possible provide an example
pankaj mishra
  • 2,555
  • 2
  • 17
  • 31

1 Answers1

0

What would be a better approach in this case to do feature selection, if possible provide an example

Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features: https://scikit-learn.org/stable/modules/feature_selection.html

If we look at the first one (Variance threshold):

from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)

this will only keep the columns which have some variance and not have only the same value in it for example.

PV8
  • 5,799
  • 7
  • 43
  • 87