0

I have a dataframe ready for modelling, it contains continuous variables and one-hot-encoded variables

ID   Limit   Bill_Sep  Bill_Aug  Payment_Sep   Payment_Aug   Gender_M   Gender_F  Edu_Uni DEFAULT_PAYMT
1    10000   2000      350       1000          350           1          0         1          1
2    30000   3000      5000      500           500           0          1         0          0
3    20000   8000      10000     8000          5000          1          0         1          1
4    45000   450       250       450           250           0          1         0          1
5    60000   700       1000      700           1000          1          0         1          1
6    8000    300       5000      300           2000          1          0         1          0
7    30000   3000      10000     1000          5000          0          1         1          1
8    15000   1000      1250      500           1750          0          1         1          1

All the numerical variables are 'int64' while the one-hot-encoded variables are 'uint8'. The binary outcome variable is DEFAULT_PAYMT.

I have gone down the usual manner of train test split here, but i wanted to see if i could apply the standardscaler only for the int64 variables (i.e., the variables that were not one-hot-encoded)?

featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

Am attempting the following code and seems to work, however, am not sure how to merge the categorical variables (that were not scaled) back into the X_scaled_tr and X_scaled_t arrays. Appreciate any form of help, thank you!

featurelist = df.drop(['ID','DEFAULT_PAYMT'],axis = 1)
X = featurelist
y = df['DEFAULT_PAYMT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

sc = StandardScaler()
X_scaled_tr = X_train.select_dtypes(include=['int64'])
X_scaled_t = X_test.select_dtypes(include=['int64'])

X_scaled_tr = sc.fit_transform(X_scaled_tr)
X_scaled_t = sc.transform(X_scaled_t)
wjie08
  • 433
  • 2
  • 11
  • Why don't you do the scaling before one hot encoding? Truncate your Dataframe in order to get the two df : one with columns for scaling, and one with the columns for OHE. Make your scaling on the columns you wish and then OHE on the others. The result will be of type numpy array, so you will have to transform both in DataFrame; and you make a concat + adjust column names if necesarry. – Catalina Chircu Mar 28 '20 at 13:41
  • I read somewhere that dummy variables should not be standardised. – wjie08 Mar 28 '20 at 13:55
  • I never said that; Read carrefully my post : I said you should standardize separately only the columns which are NOT to be parsed with OHE (OneHot Encoding). – Catalina Chircu Mar 28 '20 at 14:47

1 Answers1

1

Managed to address the question with the following code where standardscaler is only applied to the continuous variables and NOT the one-hot-encoded variables

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('X_train', StandardScaler(), ['LIMIT','BILL_SEP','BILL_AUG','PAYMENT_SEP','PAYMENT_AUG'])], remainder ='passthrough')

X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)
wjie08
  • 433
  • 2
  • 11