2

I'm working with a model, and after splitting into train and test, I want to apply StandardScaler(). However, this transformation converts my data into an array and I want to keep the format I had before. How can I do this?

Basically, I have:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df[features]
y = df[["target"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42
)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

How can I get X_train_sc back to the format that X_train had?

Update: I don't want to get X_train_sc to reverse back to before being scaled. I just want X_train_sc to be a dataframe in the easiest possible way.

FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
dmmmmd
  • 79
  • 2
  • 6
  • There should be an `inverse_transform` method for the `Standard_Scaler` that takes you back. – Sia Oct 01 '20 at 18:45
  • The inverse_transform change the data back to before being scaled. I don't want that, I just want X_train_sc to be in the same format as X_train – dmmmmd Oct 01 '20 at 18:52
  • What do you mean by *same format*? – Quang Hoang Oct 01 '20 at 18:53
  • After applying StandardScaler(), I lose track of the name of the variables. It becomes an array without the column names. I just want a dataframe like it was X_train – dmmmmd Oct 01 '20 at 18:55
  • Something to realize is that `X_train` is not scaled yet. You are using `fit_transform` that completes two tasks for the data in one step. You should use `fit` separately to keep track of the variables, then apply `transform` in a different step. – Sia Oct 01 '20 at 18:57

1 Answers1

5

As you mentioned, applying the scaling results in a numpy array, to get a dataframe you can initialize a new one:

import pandas as pd

cols = X_train.columns
sc = StandardScaler()
X_train_sc = pd.DataFrame(sc.fit_transform(X_train), columns=cols)
X_test_sc = pd.DataFrame(sc.transform(X_test), columns=cols)

2022 Update

As of scikit-learn version 1.2.0, it is possible to use the set_output API to configure transformers to output pandas DataFrames (check the doc example)

The above example would simplify as follows:

import pandas as pd

cols = X_train.columns
sc = StandardScaler().set_output(transform="pandas")
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)
FBruzzesi
  • 6,385
  • 3
  • 15
  • 37