-1

I am currently working on an auto deep learning project on python on a specific task: binary classification on tabular data. So i automated the preprocessing steps(handling missing data, encoding variables ..) to feed it to the neural network, but i don't know how to automate the search for the best architecture neural network. The code for my preprocessing steps is below:

import pandas as pd
pd.set_option('display.max_rows', None)
import numpy  as np
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
import category_encoders as ce

df=pd.read_csv(".\Taiwan_Credit_Card_Clients\default_of_credit_card_clients.csv")

#print(df.head())
#print(df.shape)

def column_types(data):
    #different columns of dataframe by dtype
    a=data.select_dtypes(include=['int64','float64']).columns
    numer_col=a.tolist()
    b=data.select_dtypes(include=['object','bool','category']).columns
    categ_col=b.tolist()
    c=data.select_dtypes(include=['datetime64','timedelta64']).columns
    date_col=c.tolist()
    output={}
    output["categorical_columns"]=categ_col
    output["numerical_columns"]=numer_col
    output["date_columns"]=date_col
    return output

#print(column_types(df))

def columns_to_drop(data):
    s=column_types(data)
    #determine columns to drop
    unique_val=data[s["numerical_columns"]].nunique()
    col_to_drop = unique_val.loc[unique_val.values==1].index.tolist()
    #remove columns that are unique to every datapoint (like id)
    for col in data.columns:
        if df.shape[0]==df[col].nunique():
            col_to_drop.append(col)
    if len(s["date_columns"])!=0:
        col_to_drop.append(s["date_columns"])
    return col_to_drop
#print(columns_to_drop(df))

#drop unnecessary columns
def drop_un_columns(data):
    data = data.drop(columns_to_drop(data),axis=1)
    return data
#a=drop_un_columns(df)
#print(a.head())

#print(df.isna().sum())

#impute missing values
def handle_miss(data):
    s=column_types(data)
    if data.isnull().values.any()==True:
        #impute mod mean
        # impute missing values in item weight by mean
        for col in s["numerical_columns"]:
            data[col].fillna(data[col].mean(),inplace=True)
        # impute outlet size in training data by mode
        for col in s["categorical_columns"]:
            data[col].fillna(data[col].mode()[0],inplace=True)
    return data

#non_miss_df=handle_miss(df)
#print(non_miss_df.head())
#print(non_miss_df.isna().sum())

#check imbalance in data:
def handle_imb_under(data,target):
    target_vals=list(data[target].value_counts().to_dict().keys())
    #create two different dataframe of majority and minority class 
    numb_1st_class=data[target].value_counts().to_dict()[target_vals[0]]
    numb_2nd_class=data[target].value_counts().to_dict()[target_vals[1]]
    #fix threshhold of 20% difference in the unbalance
    if abs(numb_1st_class-numb_2nd_class)>20:
        if numb_1st_class<=numb_2nd_class:
            df_majority = data[(data[target]==target_vals[0])]
            df_minority = data[(data[target]==target_vals[1])] 
        else:
            df_majority = data[(data[target]==target_vals[1])]
            df_minority = data[(data[target]==target_vals[0])] 
        # upsample minority class
        df_minority_upsampled = resample(df_minority, 
                                        replace=True,    # sample with replacement
                                        n_samples= len(df_majority), # to match majority class
                                        random_state=42)  # reproducible results
        # Combine majority class with upsampled minority class
        df_undersampled = pd.concat([df_minority_upsampled, df_majority])

    ##or use smote:
    #X, y = SMOTE().fit_resample(list(x_y(data,target)[0], list(x_y(data,target)[1])))
    #X_resampled, y_resampled = SMOTE().fit_resample(X, y)
    return df_undersampled

b=handle_imb_under(df,'default payment next month')
#print(b.shape)

#divide data into training and target
def x_y(data,target):
    X = data.loc[:, data.columns!=target]
    y = data[[target]]
    return (X,y)

a=x_y(b,'default payment next month')
#print(a[0].shape)
#print(a[1].shape)

#print(a[1].value_counts())


#divide data to train and validation:
def train_val(x,y):
    X_train, X_val, y_train, y_val = train_test_split(x, y, 
                                                  test_size=0.2, 
                                                  random_state=42, 
                                                  shuffle=True)
    return(X_train, X_val, y_train, y_val)

c=train_val(a[0],a[1])
#print(c[0].head())
#print(c[0].shape)
#print(c[1].head())
#print(c[1].shape)
#print(c[2].head())
#print(c[2].shape)
#print(c[3].head())
#print(c[3].shape)
##################perform minmaxscaler on each column
#apply minmaxscaler on integer features
def minmaxscaler(xtrain,xval):
    xtrain.reset_index(drop=True,inplace=True)
    xval.reset_index(drop=True,inplace=True)
    cols=column_types(xtrain)    
    num_cols=cols["numerical_columns"]   
    cat_cols=cols["categorical_columns"]   
    scaler  = MinMaxScaler()
    X_train_cat=xtrain[cat_cols]
    X_train_num = scaler.fit_transform(xtrain[num_cols])
    X_train_num=pd.DataFrame(X_train_num,columns=num_cols)
    X_train=pd.concat([X_train_num,X_train_cat],axis=1)
    X_val_num  = scaler.transform(xval[num_cols])
    X_val_cat=xval[cat_cols]
    X_val_num=pd.DataFrame(X_val_num,columns=num_cols)
    X_val=pd.concat([X_val_num,X_val_cat],axis=1)
    return X_train,X_val

print(minmaxscaler(c[0],c[1])[0].head())
print(minmaxscaler(c[0],c[1])[1].head())
x_train_scaled=minmaxscaler(c[0],c[1])[0]
x_val_scaled=minmaxscaler(c[0],c[1])[1]


def encode(a):
    le = LabelEncoder()
    le.fit(a)
    le.transform(a)

def lab_encode(ytrain,yval,target):
    ytrain_encoder=ytrain
    yval_encoder=yval
    if ytrain[target].dtype not in ['int64','float64']:
        le = LabelEncoder()
        le.fit(ytrain)
        ytrain_encoder=le.transform(ytrain)
        ytrain_encoder=pd.DataFrame(ytrain_encoder,columns=[target])
        yval_encoder=le.transform(yval)
        yval_encoder=pd.DataFrame(yval_encoder,columns=[target])
    else:
        print('////////////////')
    return ytrain_encoder,yval_encoder

#print(lab_encode(c[2],c[3],'default payment next month'))

def encode_cat(xtrain,xval):
    # create an object of the OneHotEncoder
    s=column_types(xtrain)
    OHE = ce.OneHotEncoder(cols=s["categorical_columns"],use_cat_names=True)
    # encode the categorical variables
    xtrain_encoder = OHE.fit_transform(xtrain)
    xval_encoder=OHE.transform(xval)
    return xtrain_encoder, xval_encoder

also i wanna know guys what do you think about the preprocessing steps i did? is there any improvements i can make?

PS: the dataset i am using is this : https://www.kaggle.com/datasets/jishnukoliyadan/taiwan-default-credit-card-clients just to get me started

1 Answers1

1

I have a conflict of interest here, but I think the tool I wrote, Deep Neural Network Doodler, is exactly what you’re looking for if you want something quick: you specify the prediction task, and it generates both the architecture and the neural network. Technically, it does regression, but that’s easy to adapt for classification. For a more heavyweight solution, check out DeepHyper from Argonne. Finally, GitHub has a whole page devoted to Neural Architecture Search. The downside is many of these packages are slow. Just know that none of these solutions is likely to find the very best architecture.

C deeply
  • 11
  • 4
  • Normally, just pointing to off-site resources is considered a link-only NAA ("not an answer"), which typically would be deleted with the possibility of being undeleted if [edit]ed to expand it. In general, with something like this, we'd like to see at least a description of how to use the suggested tools to solve the question which was asked. In this case, the question is both unclear and too broad, so I'm going to leave this and hope that the question author comes back and edits their question. – Makyen Aug 21 '22 at 03:18
  • 1
    As to your conflict of interest, thank you for mentioning it. [We have rules](/help/promotion) that when linking to, recommending, or suggesting something you are affiliated with you must disclose that affiliation in the post, but the disclosure doesn't need to be formal. It just needs to be clear and in the post. What you had might not have been understood by everyone reading this, so I edited to make your affiliation clearer. – Makyen Aug 21 '22 at 03:22